R – The Lab-R-torian

Feature Engineering of the Gamma Region for Machine Learning Identification of Monoclonal Proteins

April 21, 2021April 21, 2021 dtholmes@mail.ubc.ca

Background

In a previous post I was toying with better ways to integrate monoclonal proteins by deconvolution of monoclonal peaks. Turns out this is a hard problem. In any case, Deep Learning Neural Networks have become so mature that I wanted to see how easy it is to identify the presence of monoclonal proteins in the gamma region from the densitometric scan. This is not a new idea as it was first suggested in 1992 in J. Clin Pathol by Kratzer et al. I have to say, that was a very cool idea but as far as I can tell, it has not gotten much traction. This paper is mostly cited by review articles but there are a handful of further attempts. The challenges of developing a commercial product are probably regulatory in nature. However, I am hopeful that it would not be too hard to make a decision support tool.

When Kratzer et al. tackled this problem he was working on an Escom “IBM Compatible” computer with a 286 processor and 1 MB of RAM. Escom is a defunct German computer company that purchased Commodore after it began to flounder. This is why it would have been important for them to engineer their features carefully. I expect that Tensorflow may not need much feature engineering at all but I want to try what they have suggested as the features for the gamma region.

They took the scan of the gamma region, baselined it and then made it periodic and applied fast fourier transform (FFT) to represent the scan in frequency space. This is a very good idea.

Some Basic Stuff and Sanity Checking

Let’s make a function that is a sin function for $\ell \le x \le \ell$ and is otherwise 0.

f1 <- function(t, l) {
  if (t < -l | t > l) {
    res <- 0
  } else {
    res <- sin(2*pi * t)
  }
  return(res)
}
f1v <- Vectorize(f1)

f1 <- function(t, l) {

if (t < -l | t > l) {

res <- 0

} else {

res <- sin(2*pi * t)

}

return(res)

}

f1v <- Vectorize(f1)

And now let’s examine this function for $-10 < t < 10$ and $\ell = 1$

The FFT of this function can be determined with the R fft() function. The differential $\Delta f$ is calculated with the formula

\[\Delta f = \frac{1}{N \Delta t} \]

Zooming in a bit we have:

df %>%
  ggplot(aes(x = f, y = absz)) +
  geom_line() + 
  xlim(c(0,10))

df %>%

ggplot(aes(x = f, y = absz)) +

geom_line() +

xlim(c(0,10))

This looks as it should – trying to be a delta function, namely $\delta(x – 1)$. As we increase the number of periods we feed the FFT the better this becomes:

df <- NULL
tmp <- data.frame(t = t, y = rep(NA,N), z = rep(NA,N), absz = rep(NA,N), f = rep(NA,N), nT = rep(NA,N))

for(i in 1:5) {
  tmp <- tmp %>%
    mutate(y = f1v(t, 2 * i)) %>%
    mutate(z = fft(y)) %>%
    mutate(absz = abs(z)) %>%
    mutate(deltaf = 1 / (N * ((b - a) / N))) %>%
    mutate(f = (1:N - 1) * deltaf) %>%
    mutate(nT = factor(paste(2 * i, "periods"), levels = paste(2 * i, "periods")))
  df <- bind_rows(df, tmp)
}

df <- NULL

tmp <- data.frame(t = t, y = rep(NA,N), z = rep(NA,N), absz = rep(NA,N), f = rep(NA,N), nT = rep(NA,N))

for(i in 1:5) {

tmp <- tmp %>%

mutate(y = f1v(t, 2 * i)) %>%

mutate(z = fft(y)) %>%

mutate(absz = abs(z)) %>%

mutate(deltaf = 1 / (N * ((b - a) / N))) %>%

mutate(f = (1:N - 1) * deltaf) %>%

mutate(nT = factor(paste(2 * i, "periods"), levels = paste(2 * i, "periods")))

df <- bind_rows(df, tmp)

}

So as the number of periods increase, things improve and by 10 periods, we have a very sharp $\delta$ function.

df %>%
  group_by(nT) %>%
  ggplot(aes(x = f, y = absz)) +
  geom_line() +
  xlim(c(0,5)) +
  facet_wrap(~nT)

df %>%

group_by(nT) %>%

ggplot(aes(x = f, y = absz)) +

geom_line() +

xlim(c(0,5)) +

facet_wrap(~nT)

Applying FFT to a Gaussian

We can do the same thing to a Gaussian to mimic the gamma region of electrophoresis. To do this we just replicate the gamma function, inverting it in order to make the resulting function odd. Let’s just examine 5 periods for simplicity which is 10 replications of the gamma region.

library(gridExtra)
N <- 1000
a <- 0
b <- 10
t <- seq(a, b, length.out = N)

f2 <- function(x) {
  remainder <- x - trunc(x)
  if (ceiling(x) %% 2 == 1) {
    res <- +exp(-25 * (remainder - 0.5) ^ 2)
  } else {
    res <- -exp(-25 * (remainder - 0.5) ^ 2)
  }
  return(res)
}

f2v <- Vectorize(f2)
df <- data.frame(t = t, y = rep(NA,N), z = rep(NA,N), absz = rep(NA,N), f = rep(NA,N))
df <- df %>%
  mutate(y = f2v(t)) %>%
  mutate(z = fft(y)) %>%
  mutate(absz = abs(z)) %>%
  mutate(deltaf = 1/(N*((b-a)/N))) %>%
  mutate(f = (1:N - 1)*deltaf)

p1 <- df %>%
  ggplot(aes(x = t, y = y)) +
  geom_line() +
  xlim(c(0,b)) 

p2 <- df %>%
  ggplot(aes(x = f, y = absz)) +
  geom_line() +
  xlim(c(0,b)) 

grid.arrange(p1, p2, ncol=2)

library(gridExtra)

N <- 1000

a <- 0

b <- 10

t <- seq(a, b, length.out = N)

f2 <- function(x) {

remainder <- x - trunc(x)

if (ceiling(x) %% 2 == 1) {

res <- +exp(-25 * (remainder - 0.5) ^ 2)

} else {

res <- -exp(-25 * (remainder - 0.5) ^ 2)

}

return(res)

}

f2v <- Vectorize(f2)

df <- data.frame(t = t, y = rep(NA,N), z = rep(NA,N), absz = rep(NA,N), f = rep(NA,N))

df <- df %>%

mutate(y = f2v(t)) %>%

mutate(z = fft(y)) %>%

mutate(absz = abs(z)) %>%

mutate(deltaf = 1/(N*((b-a)/N))) %>%

mutate(f = (1:N - 1)*deltaf)

p1 <- df %>%

ggplot(aes(x = t, y = y)) +

geom_line() +

xlim(c(0,b))

p2 <- df %>%

ggplot(aes(x = f, y = absz)) +

geom_line() +

xlim(c(0,b))

grid.arrange(p1, p2, ncol=2)

So, with a pure Gaussian, we more or less have 4 peaks in the FFT. If we now mimic a monoclonal protein by adding a second Gaussian, we can see that the FFT becomes more complex because more frequencies (i.e. $n$) of $e^{\frac{-2 \pi i}{N}kn}$ are required to characterize the concatenated periodic function.


N <- 10000
a <- 0
b <- 10
t <- seq(a, b, length.out = N)

f3 <- function(x){
  remainder <- x - trunc(x)
  if(ceiling(x) %% 2 == 1){
  res <- exp(-25*(remainder-0.5)^2) + exp(-150*(remainder-0.8)^2)
  } else {
  res <- -exp(-25*(remainder-0.5)^2) - exp(-150*(remainder-0.8)^2)
  }
  return(res)
}

f3v <- Vectorize(f3)
df <- data.frame(t = t, y = rep(NA,N), z = rep(NA,N), absz = rep(NA,N), f = rep(NA,N))
df <- df %>%
  mutate(y = f3v(t)) %>%
  mutate(z = fft(y)) %>%
  mutate(absz = abs(z)) %>%
  mutate(deltaf = 1/(N*((b-a)/N))) %>%
  mutate(f = (1:N - 1)*deltaf)

p1 <- df %>%
  ggplot(aes(x = t, y = y)) +
  geom_line() +
  xlim(c(0,b)) 

p2 <- df %>%
  ggplot(aes(x = f, y = absz)) +
  geom_line() +
  xlim(c(0,b)) 

grid.arrange(p1, p2, ncol=2)

N <- 10000

a <- 0

b <- 10

t <- seq(a, b, length.out = N)

f3 <- function(x){

remainder <- x - trunc(x)

if(ceiling(x) %% 2 == 1){

res <- exp(-25*(remainder-0.5)^2) + exp(-150*(remainder-0.8)^2)

} else {

res <- -exp(-25*(remainder-0.5)^2) - exp(-150*(remainder-0.8)^2)

}

return(res)

}

f3v <- Vectorize(f3)

df <- data.frame(t = t, y = rep(NA,N), z = rep(NA,N), absz = rep(NA,N), f = rep(NA,N))

df <- df %>%

mutate(y = f3v(t)) %>%

mutate(z = fft(y)) %>%

mutate(absz = abs(z)) %>%

mutate(deltaf = 1/(N*((b-a)/N))) %>%

mutate(f = (1:N - 1)*deltaf)

p1 <- df %>%

ggplot(aes(x = t, y = y)) +

geom_line() +

xlim(c(0,b))

p2 <- df %>%

ggplot(aes(x = f, y = absz)) +

geom_line() +

xlim(c(0,b))

grid.arrange(p1, p2, ncol=2)

The gamma region with a monoclonal results in an obviously more complex FFT.

Apply to a Real Gamma Region

Here is an example densitometric scan. We can pull out the gamma region.

Then we can baseline the gamma region, replicate it 10 times and apply the FFT.

library(baseline)
N <- 1000
a <- 0
b <- 10
t <- seq(a, b, length.out = N)

gamma <- filter(real_data, t > 0.6)

bl <- baseline(matrix(gamma$y, nrow = 1), method = "modpolyfit", degree = 1)
gamma$y_corr <- bl@corrected[1,]
gamma$y_corr[gamma$y_corr < 0.01] <- 0
gamma$y_corr <- stats::filter(gamma$y_corr, rep(0.2,5), circular = TRUE) # force a little smoothing to get rid of cusps

#normalize the domain
gamma$t_corr <- (gamma$t - head(gamma$t,1))/(tail(gamma$t,1) - head(gamma$t,1))

gamma %>%
  ggplot(aes(x = t_corr, y = y_corr)) +
  geom_line()

library(baseline)

N <- 1000

a <- 0

b <- 10

t <- seq(a, b, length.out = N)

gamma <- filter(real_data, t > 0.6)

bl <- baseline(matrix(gamma$y, nrow = 1), method = "modpolyfit", degree = 1)

gamma$y_corr <- bl@corrected[1,]

gamma$y_corr[gamma$y_corr < 0.01] <- 0

gamma$y_corr <- stats::filter(gamma$y_corr, rep(0.2,5), circular = TRUE) # force a little smoothing to get rid of cusps

#normalize the domain

gamma$t_corr <- (gamma$t - head(gamma$t,1))/(tail(gamma$t,1) - head(gamma$t,1))

gamma %>%

ggplot(aes(x = t_corr, y = y_corr)) +

geom_line()

# make a spline function
ep_fun <- splinefun(gamma$t_corr, gamma$y_corr)

f4 <- function(x){
  remainder <- x - trunc(x)
  if(ceiling(x) %% 2 == 1){
  res <- ep_fun(remainder)
  } else {
  res <- -ep_fun(remainder)
  }
  return(res)
}

f4v <- Vectorize(f4)

df <- data.frame(t = t, y = rep(NA,N), z = rep(NA,N), absz = rep(NA,N), f = rep(NA,N))
df <- df %>%
  mutate(y = f4v(t)) %>%
  mutate(z = fft(y)) %>%
  mutate(absz = abs(z)) %>%
  mutate(deltaf = 1/(N*((b-a)/N))) %>%
  mutate(f = (1:N - 1)*deltaf)

p1 <- df %>%
  ggplot(aes(x = t, y = y)) +
  geom_line() +
  xlim(c(0,b)) 

p2 <- df %>%
  ggplot(aes(x = f, y = absz)) +
  geom_line() +
  xlim(c(0,b)) 

grid.arrange(p1, p2, ncol=2)

# make a spline function

ep_fun <- splinefun(gamma$t_corr, gamma$y_corr)

f4 <- function(x){

remainder <- x - trunc(x)

if(ceiling(x) %% 2 == 1){

res <- ep_fun(remainder)

} else {

res <- -ep_fun(remainder)

}

return(res)

}

f4v <- Vectorize(f4)

df <- data.frame(t = t, y = rep(NA,N), z = rep(NA,N), absz = rep(NA,N), f = rep(NA,N))

df <- df %>%

mutate(y = f4v(t)) %>%

mutate(z = fft(y)) %>%

mutate(absz = abs(z)) %>%

mutate(deltaf = 1/(N*((b-a)/N))) %>%

mutate(f = (1:N - 1)*deltaf)

p1 <- df %>%

ggplot(aes(x = t, y = y)) +

geom_line() +

xlim(c(0,b))

p2 <- df %>%

ggplot(aes(x = f, y = absz)) +

geom_line() +

xlim(c(0,b))

grid.arrange(p1, p2, ncol=2)

And that is more or less the gist of things. The simpler the gamma region is, the fewer peaks there will be in the frequency domain. This permits recapitulation of what Kratzer et al did in their 1992 paper with modern techniques. As I say, I am not sure its necessary but it might improve model performance. Of course it would seem possible to FFT the whole scan also and reduce the dimensionality of problem to about 30 frequencies and amplitudes.

Parting Thought

As the FFT discerns the frequencies of the function, so also:

“Do not consider his appearance or his height … The Lord does not look at the things people look at. People look at the outward appearance, but the Lord looks at the heart.”

1 Sam 16:7

A Deep Learning Classifier of New Testament Verse Authorship using the R Keras Package

April 7, 2021April 7, 2021 dtholmes@mail.ubc.ca

Introduction

This is the first of what I am hoping are a number of posts on different machine learning classifiers. The subject matter is not lab medicine but the methodology applies to any similar project. For example, maybe you want to classify the text of a general internal medicine consult into its subspecialty based on the words used or perhaps you want to determine which IT tickets are likely high priority. Maybe you want to convert free text diagnoses into categorical diagnoses. Ultimately, the problem I want to tackle is text classification.

In any case, the book that I have been reading at home is Deep Learning with R by Francois Chollet JJ Allaire and given the many interesting and easy-to-follow examples. Since it’s on my mind, I thought a deep learning model would be a good place to start. But I did not want to just redo one of the examples from the book because the data sets are already cleansed and in that sense much of the heavy lifting is done. I wanted to start from a new data set and use the approach shown in section 3.5 but apply it to a new text classification problem. Because I want to follow the basic flow of the Reuters News Wire classifier, I need a similar natural language processing (NLP) multiclass text classifier problem.

The problem I have chosen is one of for authorship classification. Specifically, given any Greek sentence take from the New Testament, can I make a deep learning classifier that will identify the author of a verse that the classifier has never seen?

Data Cleansing

The text of the New Testament is available online from numerous sources. I downloaded it here and chose the Byzantine Textform 2005 file. The text has been cleansed, put to lower case and transliterated to English characters. There are several steps to get it to a simple dataframe which the following code achieves. The code makes a dataframe where each row is a verse.

library(tidyverse)
library(janitor)
library(keras)
library(knitr)

# Deals with carriage returns within verses and brings all verses to one row
versify <- function(text, book) {
  book_text <- tibble(reference = rep(NA,10000), verse = rep(NA,10000), book = book)
  chap_verse_pattern <- "[:space:]*[0-9]{1,3}[:]{1}[0-9]{1,3}[:space:]*"
  chap_verse_indices <- str_which(text, chap_verse_pattern) 
  chap_verse_indices <- c(1, chap_verse_indices)
  for(i in 1:(length(chap_verse_indices) - 1)){
    verse_text <- paste(text[chap_verse_indices[i] : (chap_verse_indices[i + 1] - 1)], collapse = " ")
    reference <- str_extract(verse_text, chap_verse_pattern) %>% str_trim
    verse_text <- str_replace(verse_text, chap_verse_pattern, "")
    book_text$verse[i] <- verse_text
    if(length(reference) == 0){
      book_text$reference[i] <- NA
    } else {
      book_text$reference[i] <- reference
    }
  }
  book_text <- book_text %>%
    filter(verse != "") %>%
    na.omit()
  return(book_text)
}

# read in the books and get the in the expected order just for readability
books <- list.files("Greek/", pattern = ".ASC")
book_order <- c("MT05.ASC",
                "MR05.ASC",
                "LU05.ASC",
                "JOH05.ASC",
                "AC05.ASC",
                "RO05.ASC",
                "1CO05.ASC",
                "2CO05.ASC",
                "GA05.ASC",
                "EPH05.ASC",
                "PHP05.ASC",
                "COL05.ASC",
                "1TH05.ASC",
                "2TH05.ASC",
                "1TI05.ASC",
                "2TI05.ASC",
                "TIT05.ASC",
                "PHM05.ASC",
                "HEB05.ASC",
                "JAS05.ASC",
                "1PE05.ASC",
                "2PE05.ASC",
                "1JO05.ASC",
                "2JO05.ASC",
                "3JO05.ASC",
                "JUDE05.ASC",
                "RE05.ASC")
# sort book names
book_names <- sort(factor(books, levels = book_order))

# set up author vector
authors <- c("matthew",
             "mark",
             "luke",
             "john",
             "luke",
             rep("paul", 13),
             "unknown",
             "james",
             rep("peter", 2),
             rep("john", 3),
             "jude",
             "john")

# book author dataframe
authors <- tibble(book = book_names, author = authors)

# prepare empty tibble to store text
nt_frame <- tibble(reference = character(), verse = character(), book = character())

for(i in 1:length(book_names)){
  tmp <- readLines(con = paste0("Greek/",as.character(books[i])))
  tmp <- versify(tmp, books[i])
  nt_frame <- bind_rows(nt_frame,tmp)
}

nt_frame <- left_join(nt_frame, authors, by = "book")

# force correct display order
nt_frame$book <- factor(nt_frame$book, levels = book_names)
nt_frame <- arrange(nt_frame, book)

library(tidyverse)

library(janitor)

library(keras)

library(knitr)

# Deals with carriage returns within verses and brings all verses to one row

versify <- function(text, book) {

book_text <- tibble(reference = rep(NA,10000), verse = rep(NA,10000), book = book)

chap_verse_pattern <- "[:space:]*[0-9]{1,3}[:]{1}[0-9]{1,3}[:space:]*"

chap_verse_indices <- str_which(text, chap_verse_pattern)

chap_verse_indices <- c(1, chap_verse_indices)

for(i in 1:(length(chap_verse_indices) - 1)){

verse_text <- paste(text[chap_verse_indices[i] : (chap_verse_indices[i + 1] - 1)], collapse = " ")

reference <- str_extract(verse_text, chap_verse_pattern) %>% str_trim

verse_text <- str_replace(verse_text, chap_verse_pattern, "")

book_text$verse[i] <- verse_text

if(length(reference) == 0){

book_text$reference[i] <- NA

} else {

book_text$reference[i] <- reference

}

book_text <- book_text %>%

filter(verse != "") %>%

na.omit()

return(book_text)

}

# read in the books and get the in the expected order just for readability

books <- list.files("Greek/", pattern = ".ASC")

book_order <- c("MT05.ASC",

"MR05.ASC",

"LU05.ASC",

"JOH05.ASC",

"AC05.ASC",

"RO05.ASC",

"1CO05.ASC",

"2CO05.ASC",

"GA05.ASC",

"EPH05.ASC",

"PHP05.ASC",

"COL05.ASC",

"1TH05.ASC",

"2TH05.ASC",

"1TI05.ASC",

"2TI05.ASC",

"TIT05.ASC",

"PHM05.ASC",

"HEB05.ASC",

"JAS05.ASC",

"1PE05.ASC",

"2PE05.ASC",

"1JO05.ASC",

"2JO05.ASC",

"3JO05.ASC",

"JUDE05.ASC",

"RE05.ASC")

# sort book names

book_names <- sort(factor(books, levels = book_order))

# set up author vector

authors <- c("matthew",

"mark",

"luke",

"john",

"luke",

rep("paul", 13),

"unknown",

"james",

rep("peter", 2),

rep("john", 3),

"jude",

"john")

# book author dataframe

authors <- tibble(book = book_names, author = authors)

# prepare empty tibble to store text

nt_frame <- tibble(reference = character(), verse = character(), book = character())

for(i in 1:length(book_names)){

tmp <- readLines(con = paste0("Greek/",as.character(books[i])))

tmp <- versify(tmp, books[i])

nt_frame <- bind_rows(nt_frame,tmp)

}

nt_frame <- left_join(nt_frame, authors, by = "book")

# force correct display order

nt_frame$book <- factor(nt_frame$book, levels = book_names)

nt_frame <- arrange(nt_frame, book)

Now that this wrangling is complete, we have a tibble that looks like this:

# show tibble format
kable(head(nt_frame, 10))

# show tibble format

kable(head(nt_frame, 10))

reference	verse	book	author
1:1	biblov genesewv ihsou cristou uiou dauid uiou abraam	MT05.ASC	matthew
1:2	abraam egennhsen ton isaak isaak de egennhsen ton iakwb iakwb de egennhsen ton ioudan kai touv adelfouv autou	MT05.ASC	matthew
1:3	ioudav de egennhsen ton farev kai ton zara ek thv yamar farev de egennhsen ton esrwm esrwm de egennhsen ton aram	MT05.ASC	matthew
1:4	aram de egennhsen ton aminadab aminadab de egennhsen ton naasswn naasswn de egennhsen ton salmwn	MT05.ASC	matthew
1:5	salmwn de egennhsen ton booz ek thv racab booz de egennhsen ton wbhd ek thv rouy wbhd de egennhsen ton iessai	MT05.ASC	matthew
1:6	iessai de egennhsen ton dauid ton basilea dauid de o basileuv egennhsen ton solomwna ek thv tou ouriou	MT05.ASC	matthew
1:7	solomwn de egennhsen ton roboam roboam de egennhsen ton abia abia de egennhsen ton asa	MT05.ASC	matthew
1:8	asa de egennhsen ton iwsafat iwsafat de egennhsen ton iwram iwram de egennhsen ton ozian	MT05.ASC	matthew
1:9	oziav de egennhsen ton iwayam iwayam de egennhsen ton acaz acaz de egennhsen ton ezekian	MT05.ASC	matthew
1:10	ezekiav de egennhsen ton manassh manasshv de egennhsen ton amwn amwn de egennhsen ton iwsian	MT05.ASC	matthew

We should get verse counts that match what is expected, which we do.

# sanity check the verse counts by book
verse_counts <- nt_frame %>%
  group_by(book) %>%
  summarise(counts = n())
kable(verse_counts)

# sanity check the verse counts by book

verse_counts <- nt_frame %>%

group_by(book) %>%

summarise(counts = n())

kable(verse_counts)

book	counts
MT05.ASC	1070
MR05.ASC	677
LU05.ASC	1149
JOH05.ASC	878
AC05.ASC	1003
RO05.ASC	432
1CO05.ASC	436
2CO05.ASC	256
GA05.ASC	148
EPH05.ASC	154
PHP05.ASC	103
COL05.ASC	94
1TH05.ASC	88
2TH05.ASC	46
1TI05.ASC	112
2TI05.ASC	82
TIT05.ASC	45
PHM05.ASC	24
HEB05.ASC	302
JAS05.ASC	107
1PE05.ASC	104
2PE05.ASC	60
1JO05.ASC	104
2JO05.ASC	12
3JO05.ASC	13
JUDE05.ASC	24
RE05.ASC	403

And we can check the unique word count

# check unique words
word_list <- paste(nt_frame$verse, collapse = " ") %>%
  str_split(" ") %>%
  unlist %>%
  str_replace("[:punct:]","") %>%
  tolower()
length(unique(word_list))

# check unique words

word_list <- paste(nt_frame$verse, collapse = " ") %>%

str_split(" ") %>%

unlist %>%

str_replace("[:punct:]","") %>%

tolower()

length(unique(word_list))

## [1] 17156

1	## [1] 17156

Normally at this point, we might remove stop words and then stem and lemmatize the text (ie get rid useless little words and suffixes that cause words of the same meaning to look different). This would be more important in more traditional learning classifiers but is likely less important when using Keras and Tensorflow. If I were running this classifier on the English text of the KJV for example, I would run it with and without such a process and guage the performance change. There are numerous NLP packages specifically dedicated to this task. I am going to skip it here. This process is, of course, highly language-dependent.

The other thing I need to do is make the author-factor column numbered 0-8 instead of 1-9 because R is going to be calling python code and python starts counting a 0. This bug took me a while to sort out.

nt_frame <- nt_frame %>%
  mutate(author_factor = as.numeric(as.factor(nt_frame$author)) - 1) %>% #pythonic 
  mutate(verse_number  = 1:nrow(nt_frame))

nt_frame <- nt_frame %>%

mutate(author_factor = as.numeric(as.factor(nt_frame$author)) - 1) %>% #pythonic

mutate(verse_number = 1:nrow(nt_frame))

Now we will make a tokenizer, that is a function to convert words to integers and we will limit the model to the top 15000 out of the 17156 unique words found in the text.

max_features <- 15000
text <- nt_frame$verse
tokenizer <- text_tokenizer(num_words = max_features) %>%
  fit_text_tokenizer(text)

max_features <- 15000

text <- nt_frame$verse

tokenizer <- text_tokenizer(num_words = max_features) %>%

fit_text_tokenizer(text)

Now we need to split the text randomly into training and testing sets in an 80:20 split.

set.seed(316)
# Select random indices for training using 80% of the data. Notice that these are random!
training_id <- sample.int(nrow(nt_frame), size = nrow(nt_frame)*0.8)
# for reference separate training from testing data
train_data <- nt_frame[training_id,]
test_data <- nt_frame[-training_id,]

set.seed(316)

# Select random indices for training using 80% of the data. Notice that these are random!

training_id <- sample.int(nrow(nt_frame), size = nrow(nt_frame)*0.8)

# for reference separate training from testing data

train_data <- nt_frame[training_id,]

test_data <- nt_frame[-training_id,]

The data is very imbalanced, that is, there are authors (like Jude and James) that have very few verses ascribed to them and there are others (like Paul and Luke) who have many verses. For this reason, we should sanity check our training and testing data to make sure that we have sampled about 80% of each book. There are specific tools to achieve this process which is referred to as stratified sampling.

train_n <- table(train_data$author_factor)
test_n <- table(test_data$author_factor)
train_props <- round(train_n/(train_n + test_n),2)
train_props

train_n <- table(train_data$author_factor)

test_n <- table(test_data$author_factor)

train_props <- round(train_n/(train_n + test_n),2)

train_props

## 
##    0    1    2    3    4    5    6    7    8 
## 0.78 0.79 0.67 0.81 0.81 0.81 0.79 0.80 0.81

## 0 1 2 3 4 5 6 7 8

## 0.78 0.79 0.67 0.81 0.81 0.81 0.79 0.80 0.81

We can see that we have a problem with author 2 who has only 24 verses. This is probably not going to matter much but we can try balanced sampling for which we do get better proportions.

# do balanced sampling
library(splitstackshape)
# get the balanced sample
train_data <- stratified(nt_frame, "author_factor", .8)
# randomize the sample
train_data <- train_data[sample(nrow(train_data)),]
training_id <- train_data$verse_number

# the test indices are therefore
testing_id <- which(!(nt_frame$verse_number %in% training_id))

# randomize the sample
test_data <- nt_frame[testing_id,]
test_data <- test_data[sample(nrow(test_data)),]

train_n <- table(train_data$author_factor)
test_n <- table(test_data$author_factor)

train_props <- round(train_n/(train_n + test_n),2)
train_props

# do balanced sampling

library(splitstackshape)

# get the balanced sample

train_data <- stratified(nt_frame, "author_factor", .8)

# randomize the sample

train_data <- train_data[sample(nrow(train_data)),]

training_id <- train_data$verse_number

# the test indices are therefore

testing_id <- which(!(nt_frame$verse_number %in% training_id))

# randomize the sample

test_data <- nt_frame[testing_id,]

test_data <- test_data[sample(nrow(test_data)),]

train_n <- table(train_data$author_factor)

test_n <- table(test_data$author_factor)

train_props <- round(train_n/(train_n + test_n),2)

train_props

## 
##    0    1    2    3    4    5    6    7    8 
## 0.80 0.80 0.79 0.80 0.80 0.80 0.80 0.80 0.80

## 0 1 2 3 4 5 6 7 8

## 0.80 0.80 0.79 0.80 0.80 0.80 0.80 0.80 0.80

Now we can tokenize the data, that is, convert the verse from lists of integers to a one-hot encoded form.

# Create the training and testing x data
x_train <- texts_to_matrix(tokenizer, text[training_id], mode = "binary")
x_test <- texts_to_matrix(tokenizer, text[-training_id], mode = "binary")

# Set the training and testing y data and then one hot encode them
train_labels <- nt_frame$author_factor[training_id]
y_train <- to_categorical(train_labels)

test_labels <- nt_frame$author_factor[-training_id]
y_test <- to_categorical(test_labels)

# Create the training and testing x data

x_train <- texts_to_matrix(tokenizer, text[training_id], mode = "binary")

x_test <- texts_to_matrix(tokenizer, text[-training_id], mode = "binary")

# Set the training and testing y data and then one hot encode them

train_labels <- nt_frame$author_factor[training_id]

y_train <- to_categorical(train_labels)

test_labels <- nt_frame$author_factor[-training_id]

y_test <- to_categorical(test_labels)

Satisfy ourselves that the training data is random in order

kable(head(nt_frame[training_id,], 10))

1 2	kable(head(nt_frame[training_id,], 10))

reference	verse	book	author	author_factor	verse_number
16:6	all oti tauta lelalhka umin h luph peplhrwken umwn thn kardian	JOH05.ASC	john	1	3584
2:20	all ecw kata sou oti afeiv thn gunaika sou iezabel h legei eauthn profhtin kai didaskei kai plana touv emouv doulouv porneusai kai fagein eidwloyuta	RE05.ASC	john	1	7563
21:23	kai elyonti autw eiv to ieron proshlyon autw didaskonti oi arciereiv kai oi presbuteroi tou laou legontev en poia exousia tauta poieiv kai tiv soi edwken thn exousian tauthn	MT05.ASC	matthew	5	705
5:1	dikaiwyentev oun ek pistewv eirhnhn ecomen prov ton yeon dia tou kuriou hmwn ihsou cristou	RO05.ASC	paul	6	4895
12:29	h pwv dunatai tiv eiselyein eiv thn oikian tou iscurou kai ta skeuh autou diarpasai ean mh prwton dhsh ton iscuron kai tote thn oikian autou diarpasei	MT05.ASC	matthew	5	374
4:24	alla kai di hmav oiv mellei logizesyai toiv pisteuousin epi ton egeiranta ihsoun ton kurion hmwn ek nekrwn	RO05.ASC	paul	6	4893
27:31	eipen o paulov tw ekatontarch kai toiv stratiwtaiv ean mh outoi meinwsin en tw ploiw umeiv swyhnai ou dunasye	AC05.ASC	luke	3	4734
1:25	kai hrwthsan auton kai eipon autw ti oun baptizeiv ei su ouk ei o cristov oute hliav oute o profhthv	JOH05.ASC	john	1	2921
3:6	kai exelyontev oi farisaioi euyewv meta twn hrwdianwn sumboulion epoioun kat autou opwv auton apoleswsin	MR05.ASC	mark	4	1149
8:4	ei men gar hn epi ghv oud an hn iereuv ontwn twn ierewn twn prosferontwn kata ton nomon ta dwra	HEB05.ASC	unknown	8	6930

Now we can build a basic model:

# Build a basic model
model <- keras_model_sequential() %>% 
  layer_dense(units = 64, activation = "relu", input_shape = c(max_features)) %>% 
  layer_dense(units = 64, activation = "relu") %>% 
  layer_dense(units = 9, activation = "softmax")

model %>% compile(
  optimizer = "rmsprop",
  loss = "categorical_crossentropy",
  metrics = c("accuracy")
)

# Build a basic model

model <- keras_model_sequential() %>%

layer_dense(units = 64, activation = "relu", input_shape = c(max_features)) %>%

layer_dense(units = 64, activation = "relu") %>%

layer_dense(units = 9, activation = "softmax")

model %>% compile(

optimizer = "rmsprop",

loss = "categorical_crossentropy",

metrics = c("accuracy")

)

and pull out validation data, again in an 80:20 split.

# pull out some validation data
val_indices <- 1:floor(nrow(train_data))*0.2 
x_val <- x_train[val_indices,]
partial_x_train <- x_train[-val_indices,]
y_val <- y_train[val_indices,]
partial_y_train = y_train[-val_indices,]

# pull out some validation data

val_indices <- 1:floor(nrow(train_data))*0.2

x_val <- x_train[val_indices,]

partial_x_train <- x_train[-val_indices,]

y_val <- y_train[val_indices,]

partial_y_train = y_train[-val_indices,]

Now we run the model:

history <- model %>% keras::fit(
  partial_x_train, # train in the non-validation training data
  partial_y_train,
  epochs = 5,
  batch_size = 256,
  validation_data = list(x_val, y_val)
)

results <- model %>% 
  keras::evaluate(x_test, y_test)
results

history <- model %>% keras::fit(

partial_x_train, # train in the non-validation training data

partial_y_train,

epochs = 5,

batch_size = 256,

validation_data = list(x_val, y_val)

)

results <- model %>%

keras::evaluate(x_test, y_test)

results

##      loss  accuracy 
## 1.0402801 0.6382576

1 2	## loss accuracy ## 1.0402801 0.6382576

plot(history) +
  geom_line()

plot(history) +

geom_line()

plot of chunk unnamed-chunk-15

We can show the model performance graphically:

library(corrplot)
pred <- model %>%
  predict_classes(x_test) %>%
  factor(0:8)

res_tab <- table(Pred = pred, Act = test_labels)
res_prop <- prop.table(res_tab,2)

author_key <- tibble(author = nt_frame$author, code = nt_frame$author_factor) %>%
  unique %>%
  arrange(code)

colnames(res_prop) <- author_key$author
rownames(res_prop) <- author_key$author
corrplot(res_prop,is.corr = FALSE,
         method = "circle", addCoef.col = "lightblue", number.cex = 0.7)

library(corrplot)

pred <- model %>%

predict_classes(x_test) %>%

factor(0:8)

res_tab <- table(Pred = pred, Act = test_labels)

res_prop <- prop.table(res_tab,2)

author_key <- tibble(author = nt_frame$author, code = nt_frame$author_factor) %>%

unique %>%

arrange(code)

colnames(res_prop) <- author_key$author

rownames(res_prop) <- author_key$author

corrplot(res_prop,is.corr = FALSE,

method = "circle", addCoef.col = "lightblue", number.cex = 0.7)

plot of chunk unnamed-chunk-17

Results are not great because many authors are being misclassified as Paul or Luke. This is likely from author imbalance so we can address the imbalance with weights and with dropout layers as suggested in this very informative tutorial from
Dr. Bharatendra Rai.

model <- keras_model_sequential() %>% 
  layer_dense(units = 64, activation = "relu", input_shape = c(max_features)) %>% 
  layer_dropout(rate = 0.4) %>% 
  layer_dense(units = 32, activation = 'relu') %>% 
  layer_dropout(rate = 0.2) %>%    
  layer_dense(units = 9, activation = "softmax")

model %>% 
  compile(loss = 'categorical_crossentropy',
          optimizer = 'rmsprop',
          metrics = 'accuracy')

history <- model %>% keras::fit(
  partial_x_train,
  partial_y_train,
  epochs = 5,
  batch_size = 256,
  validation_data = list(x_val, y_val),
  class_weight = list("0" = 20.1, "1" = 1.5, "2" = 89.7, "3" = 1.0, "4" = 3.2, "5" = 2.0, "6" = 1.1, "7" = 13.1, "8" = 7.1))

results <- model %>% keras::evaluate(x_test, y_test)
results

model <- keras_model_sequential() %>%

layer_dense(units = 64, activation = "relu", input_shape = c(max_features)) %>%

layer_dropout(rate = 0.4) %>%

layer_dense(units = 32, activation = 'relu') %>%

layer_dropout(rate = 0.2) %>%

layer_dense(units = 9, activation = "softmax")

model %>%

compile(loss = 'categorical_crossentropy',

optimizer = 'rmsprop',

metrics = 'accuracy')

history <- model %>% keras::fit(

partial_x_train,

partial_y_train,

epochs = 5,

batch_size = 256,

validation_data = list(x_val, y_val),

class_weight = list("0" = 20.1, "1" = 1.5, "2" = 89.7, "3" = 1.0, "4" = 3.2, "5" = 2.0, "6" = 1.1, "7" = 13.1, "8" = 7.1))

results <- model %>% keras::evaluate(x_test, y_test)

results

##      loss  accuracy 
## 1.4781048 0.5997475

1 2	## loss accuracy ## 1.4781048 0.5997475

pred <- model %>%
  predict_classes(x_test) %>%
  factor(0:8)
res_tab <- table(Pred = pred, Act = test_labels)
res_tab

pred <- model %>%

predict_classes(x_test) %>%

factor(0:8)

res_tab <- table(Pred = pred, Act = test_labels)

res_tab

##     Act
## Pred   0   1   2   3   4   5   6   7   8
##    0   6   1   0   8   0   4  19   2   1
##    1   1 211   0  34  18  26  19   2   3
##    2   0   1   0   2   0   0   1   0   0
##    3   1  21   1 261  31  42  38   4   8
##    4   0  26   0  42  58  39   5   0   4
##    5   1   7   2  52  23  97   5   1   2
##    6  10  11   1  16   1   4 286  20  11
##    7   0   0   0   1   0   0   0   0   0
##    8   2   4   1  14   4   2  31   4  31

## Act

## Pred 0 1 2 3 4 5 6 7 8

## 0 6 1 0 8 0 4 19 2 1

## 1 1 211 0 34 18 26 19 2 3

## 2 0 1 0 2 0 0 1 0 0

## 3 1 21 1 261 31 42 38 4 8

## 4 0 26 0 42 58 39 5 0 4

## 5 1 7 2 52 23 97 5 1 2

## 6 10 11 1 16 1 4 286 20 11

## 7 0 0 0 1 0 0 0 0 0

## 8 2 4 1 14 4 2 31 4 31

res_prop <- prop.table(res_tab,2)
colnames(res_prop) <- author_key$author
rownames(res_prop) <- author_key$author

res_prop <- prop.table(res_tab,2)

colnames(res_prop) <- author_key$author

rownames(res_prop) <- author_key$author

What we get looks a little better with more counts on the diagonal.

corrplot(res_prop,is.corr = FALSE,
         method = "circle", addCoef.col = "lightblue", number.cex = 0.7)

corrplot(res_prop,is.corr = FALSE,

method = "circle", addCoef.col = "lightblue", number.cex = 0.7)

plot of chunk unnamed-chunk-19

The model is jumpy on the small books, probably because of undersampling of them. This means that k-fold cross validation help us assess model performance. Not sure if I should try to have balanced sampling in the folds but I am not going to worry about that at the moment.

build_model <- function() {
  model <- keras_model_sequential() %>% 
    layer_dense(units = 64, activation = "relu", input_shape = c(max_features)) %>% 
    layer_dropout(rate = 0.4) %>% 
    layer_dense(units = 32, activation = 'relu') %>% 
    layer_dropout(rate = 0.2) %>%    
    layer_dense(units = 9, activation = "softmax")

  model %>% 
    compile(loss = 'categorical_crossentropy',
            optimizer = 'rmsprop',
            metrics = 'accuracy')
}

build_model <- function() {

model <- keras_model_sequential() %>%

layer_dense(units = 64, activation = "relu", input_shape = c(max_features)) %>%

layer_dropout(rate = 0.4) %>%

layer_dense(units = 32, activation = 'relu') %>%

layer_dropout(rate = 0.2) %>%

layer_dense(units = 9, activation = "softmax")

model %>%

compile(loss = 'categorical_crossentropy',

optimizer = 'rmsprop',

metrics = 'accuracy')

}

Run the k-fold cross-validation.

acc_histories <- NULL
loss_histories <- NULL
k <- 4
indices <- sample(1:nrow(train_data)) 
folds <- cut(1:length(indices), breaks = k, labels = FALSE) 
num_epochs <- 50
all_scores <- c()
proportions_list <- list()

for (i in 1:k) {
  cat("processing fold #", i, "\n")
  # Prepare the validation data: data from partition # k
  val_indices <- which(folds == i, arr.ind = TRUE) 
  x_val_kfold <- x_train[val_indices,]
  y_val_kfold <- y_train[val_indices,]

  # Prepare the training data: data from all other partitions
  #partial_train_data <- train_data[-val_indices,]
  x_train_kfold <- x_train[-val_indices,]
  y_train_kfold <- y_train[-val_indices,]

  # Build the Keras model (already compiled)
  model <- build_model()

  # Train the model (in silent mode, verbose=0)
  history <- model %>% fit(x_train_kfold, y_train_kfold,
                epochs = num_epochs,
                batch_size = 256,
                validation_data = list(x_val_kfold, y_val_kfold),
                verbose = 0,
  class_weight = list("0" = 20.1, "1" = 1.5, "2" = 89.7, "3" = 1.0, "4" = 3.2, "5" = 2.0, "6" = 1.1, "7" = 13.1, "8" = 7.1))

  # Evaluate the model on the fold's validation data
  results <- model %>% keras::evaluate(x_val_kfold, y_val_kfold)
  all_scores <- c(all_scores, results["accuracy"])

  pred <- model %>%
  predict_classes(x_val_kfold) %>%
  factor(0:8)

  res_tab <- table(Pred = pred, Act = train_data$author_factor[val_indices])
  res_prop <- prop.table(res_tab,2)
  proportions_list[[i]] <- res_prop

  acc_history <- history$metrics$val_accuracy
  acc_histories <- rbind(acc_histories, acc_history)
  loss_history <- history$metrics$loss
  loss_histories <- rbind(loss_history, acc_history)
}  

all_scores
mean(all_scores)
average_acc_history <- data.frame(
  epoch = seq(1:ncol(acc_histories)),
  validation_acc = apply(acc_histories, 2, mean)
)
average_loss_history <- data.frame(
  epoch = seq(1:ncol(loss_histories)),
  validation_loss = apply(loss_histories, 2, mean)
)

acc_histories <- NULL

loss_histories <- NULL

k <- 4

indices <- sample(1:nrow(train_data))

folds <- cut(1:length(indices), breaks = k, labels = FALSE)

num_epochs <- 50

all_scores <- c()

proportions_list <- list()

for (i in 1:k) {

cat("processing fold #", i, "\n")

# Prepare the validation data: data from partition # k

val_indices <- which(folds == i, arr.ind = TRUE)

x_val_kfold <- x_train[val_indices,]

y_val_kfold <- y_train[val_indices,]

# Prepare the training data: data from all other partitions

#partial_train_data <- train_data[-val_indices,]

x_train_kfold <- x_train[-val_indices,]

y_train_kfold <- y_train[-val_indices,]

# Build the Keras model (already compiled)

model <- build_model()

# Train the model (in silent mode, verbose=0)

history <- model %>% fit(x_train_kfold, y_train_kfold,

epochs = num_epochs,

batch_size = 256,

validation_data = list(x_val_kfold, y_val_kfold),

verbose = 0,

class_weight = list("0" = 20.1, "1" = 1.5, "2" = 89.7, "3" = 1.0, "4" = 3.2, "5" = 2.0, "6" = 1.1, "7" = 13.1, "8" = 7.1))

# Evaluate the model on the fold's validation data

results <- model %>% keras::evaluate(x_val_kfold, y_val_kfold)

all_scores <- c(all_scores, results["accuracy"])

pred <- model %>%

predict_classes(x_val_kfold) %>%

factor(0:8)

res_tab <- table(Pred = pred, Act = train_data$author_factor[val_indices])

res_prop <- prop.table(res_tab,2)

proportions_list[[i]] <- res_prop

acc_history <- history$metrics$val_accuracy

acc_histories <- rbind(acc_histories, acc_history)

loss_history <- history$metrics$loss

loss_histories <- rbind(loss_history, acc_history)

}

all_scores

mean(all_scores)

average_acc_history <- data.frame(

epoch = seq(1:ncol(acc_histories)),

validation_acc = apply(acc_histories, 2, mean)

)

average_loss_history <- data.frame(

epoch = seq(1:ncol(loss_histories)),

validation_loss = apply(loss_histories, 2, mean)

)

Validation accuracy improves modestly with more epochs but the model definitely overfits the training data (getting to the high 90s in accuracy). This is a bit of a conundrum to me for which I do not know the answer (those who know, please comment): namely, I can overfit the model to make gains on the validation set and these do improve performance on the test set but I expect that this improvement is happening in some non-generalizable way.

library(ggplot2)
ggplot(average_acc_history, aes(x = epoch, y = validation_acc)) + geom_line()

library(ggplot2)

ggplot(average_acc_history, aes(x = epoch, y = validation_acc)) + geom_line()

plot of chunk unnamed-chunk-22

Likewise loss slowly declines over many epochs but the model overfits.

ggplot(average_loss_history, aes(x = epoch, y = validation_loss)) + geom_line()

1 2	ggplot(average_loss_history, aes(x = epoch, y = validation_loss)) + geom_line()

plot of chunk unnamed-chunk-23

In any case, this is the model performance rerunning the k-fold cross validation with 5 epochs.

Final Outcome

Satisfied enough that 5 epochs should be OK, I can run the model on the whole training set and look at its performance on the testing set.

model <- keras_model_sequential() %>% 
  layer_dense(units = 64, activation = "relu", input_shape = c(max_features)) %>% 
  layer_dropout(rate = 0.4) %>% 
  layer_dense(units = 32, activation = 'relu') %>% 
  layer_dropout(rate = 0.2) %>%    
  layer_dense(units = 9, activation = "softmax")

model %>% 
  compile(loss = 'categorical_crossentropy',
          optimizer = 'rmsprop',
          metrics = 'accuracy')

history <- model %>% keras::fit(
  x_train,
  y_train,
  epochs = 5,
  batch_size = 256,
  #validation_data = list(x_val, y_val),
  class_weight = list("0" = 20.1, "1" = 1.5, "2" = 89.7, "3" = 1.0, "4" = 3.2, "5" = 2.0, "6" = 1.1, "7" = 13.1, "8" = 7.1))

results <- model %>% keras::evaluate(x_test, y_test)
results

model <- keras_model_sequential() %>%

layer_dense(units = 64, activation = "relu", input_shape = c(max_features)) %>%

layer_dropout(rate = 0.4) %>%

layer_dense(units = 32, activation = 'relu') %>%

layer_dropout(rate = 0.2) %>%

layer_dense(units = 9, activation = "softmax")

model %>%

compile(loss = 'categorical_crossentropy',

optimizer = 'rmsprop',

metrics = 'accuracy')

history <- model %>% keras::fit(

x_train,

y_train,

epochs = 5,

batch_size = 256,

#validation_data = list(x_val, y_val),

class_weight = list("0" = 20.1, "1" = 1.5, "2" = 89.7, "3" = 1.0, "4" = 3.2, "5" = 2.0, "6" = 1.1, "7" = 13.1, "8" = 7.1))

results <- model %>% keras::evaluate(x_test, y_test)

results

##      loss  accuracy 
## 1.4544461 0.5568182

1 2	## loss accuracy ## 1.4544461 0.5568182

pred <- model %>%
  predict_classes(x_test) %>%
  factor(0:8)
res_tab <- table(Pred = pred, Act = test_labels)
res_tab

pred <- model %>%

predict_classes(x_test) %>%

factor(0:8)

res_tab <- table(Pred = pred, Act = test_labels)

res_tab

##     Act
## Pred   0   1   2   3   4   5   6   7   8
##    0  10   8   0  15   3   3  45   3   4
##    1   1 233   0  45  21  31  38   2  10
##    2   0   0   0   0   0   0   1   0   0
##    3   2   6   0 198  13  25  17   3   3
##    4   0  19   1  77  64  57   7   0   1
##    5   0   8   0  57  29  90   7   1   1
##    6   6   4   0  14   1   5 250  19   7
##    7   2   0   3   5   0   0   8   3   0
##    8   0   4   1  19   4   3  31   2  34

## Act

## Pred 0 1 2 3 4 5 6 7 8

## 0 10 8 0 15 3 3 45 3 4

## 1 1 233 0 45 21 31 38 2 10

## 2 0 0 0 0 0 0 1 0 0

## 3 2 6 0 198 13 25 17 3 3

## 4 0 19 1 77 64 57 7 0 1

## 5 0 8 0 57 29 90 7 1 1

## 6 6 4 0 14 1 5 250 19 7

## 7 2 0 3 5 0 0 8 3 0

## 8 0 4 1 19 4 3 31 2 34

res_prop <- prop.table(res_tab,2)
colnames(res_prop) <- author_key$author
rownames(res_prop) <- author_key$author
corrplot(res_prop,is.corr = FALSE,
         method = "circle", addCoef.col = "lightblue", number.cex = 0.7)

res_prop <- prop.table(res_tab,2)

colnames(res_prop) <- author_key$author

rownames(res_prop) <- author_key$author

corrplot(res_prop,is.corr = FALSE,

method = "circle", addCoef.col = "lightblue", number.cex = 0.7)

plot of chunk unnamed-chunk-26

Some interesting findings:

John seems to be the easiest to classify. This fits well with his unique authorship style.
The synoptic gospels are are easily misclassified among one another. Again, this fits with the overlap of stories, parables and other content.
Hebrews looks more like Hebrews than it looks like Paul. This fits with the perspective that Paul is not the author of Hebrews.
Poor James, Jude and Peter: just not enough verses to get proper classification. I am sure there are ways to address this kind of imballance were classifying Jude correctly a very important thing to do.

I think I am going to stop trying to improve this because it is not a real problem but I hope that someone else can recycle some code for a real-life problem. I would be interested in comments on how to get improved classification of small classes.

Parting Easter Thought

ouk estin wde hgeryh gar kaywv eipen deute idete ton topon opou ekeito o kuriov, Matthew 28:6

Calculate all the CVs of all the QC Levels of all the Methods of all the Instruments at all the Sites all at once … with Sunquest LIS and dplyr

February 4, 2020November 24, 2020 dtholmes@mail.ubc.ca

Background

As part of our lab accreditation requirements, we have to provide measurement uncertianty estimates for all tests at all hospital sites. As you might imagine, with thousands of testcodes in Sunquest LIS, getting all the coefficients of variation (CVs) represents a daunting task for the quality technologist to accomplish. As it turns out, by capturing the ssh session in a .txt file, you can use R’s dplyr package to do this all in few lines of code.

Getting the Raw Data

You need to get the raw data from Sunquest. You can capture the telnet (yes… older versions of Sunquest use telnet and pass patient information and user passwords unencrypted across the hospital network o_O) or the ssh session to a file using the Esker SmarTerm which Sunquest packages in their product and refers to as “roll-n-scroll”. People disparriage SmarTerm as an old “dos tool”–whereas Sunquest is hosted on AIX operating system. SmarTerm access to Sunquest is a gagillion times faster than the GUI and permits us to capture the raw QC data we need. To capture the session select from the dropdown menu as shown here:

SQ Screenshot1

If you are using Mac OS or Linux OS, you can also capture the ssh session by connecting from the terminal and using tee to dump the session to a file.

ssh user@serverIPaddress | tee captured_session.txt

Once you have connected, use the QC function and select output printer 0 (meaning the screen) and make these selections, changing the dates as appropriate:

SQ Screenshot1

If you make no selections at all for any of:

TEST:
WORKSHEET:
METHOD:
CONTROL:
SHIFT #:
TECH:
TESTS REQUESTED:

then you will extract everything, which is what you want and which will make for a very big .txt file. There will be a delay and then thousands of QC results will dump to the screen and to your file. When this is complete, end your SmarTerm or ssh or telnet (cringe) session. I saved my text dump as raw_SQ8.txt.

Getting it intro R and parsing it

Your data will come out as a fixed with file with no delimiters. It will also have a bunch of junk at the bottom and top of the file detailing your commands from the start and end of the session. These need to be discarded. I just used grep() to find all the lines with the appropriate date pattern. After reading it in, because I am lazy, I wrote it back out and read it in again with read.fwf()

library(tidyverse)
library(lubridate)
library(knitr)

# Note to my friend SK - yes... this is mostly in base-R... 

# create a connection
con < file(file.path("raw_SQ8.txt"))
raw.qc.data <- readLines(con)
close(con)
#find good rows
good.data <- grep("[0-9]{2}(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[2][0][0-9]{6}",raw.qc.data)
raw.qc.data <- raw.qc.data[good.data]
#remove a screwball encoding character
raw.qc.data[1] <- substr(raw.qc.data[1],6,nchar(raw.qc.data[1]))
con <- file("temp.txt")
#rewrite the file with no garbage in it.
writeLines(raw.qc.data, con)
close(con)
raw.qc.data <- read.fwf("temp.txt",c(6,6,6,20,13,6,2,15,100))
file.remove("temp.txt")
names(raw.qc.data) <- c("test.code","instr.code","qc.name","qc.expire",
                        "date.performed","tech.code","shift",
                        "result","modifier")
raw.qc.data <- data.frame(lapply(raw.qc.data, trimws))
raw.qc.data$result <- as.numeric(as.character(raw.qc.data$result))
raw.qc.data$date.performed <- dmy_hm(raw.qc.data$date.performed)
raw.qc.data$tech.code <- as.numeric(raw.qc.data$tech.code) #anonymize tech codes
raw.qc.data <- arrange(raw.qc.data, instr.code, test.code)

library(tidyverse)

library(lubridate)

library(knitr)

# Note to my friend SK - yes... this is mostly in base-R...

# create a connection

con < file(file.path("raw_SQ8.txt"))

raw.qc.data <- readLines(con)

close(con)

#find good rows

good.data <- grep("[0-9]{2}(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[2][0][0-9]{6}",raw.qc.data)

raw.qc.data <- raw.qc.data[good.data]

#remove a screwball encoding character

raw.qc.data[1] <- substr(raw.qc.data[1],6,nchar(raw.qc.data[1]))

con <- file("temp.txt")

#rewrite the file with no garbage in it.

writeLines(raw.qc.data, con)

close(con)

raw.qc.data <- read.fwf("temp.txt",c(6,6,6,20,13,6,2,15,100))

file.remove("temp.txt")

names(raw.qc.data) <- c("test.code","instr.code","qc.name","qc.expire",

"date.performed","tech.code","shift",

"result","modifier")

raw.qc.data <- data.frame(lapply(raw.qc.data, trimws))

raw.qc.data$result <- as.numeric(as.character(raw.qc.data$result))

raw.qc.data$date.performed <- dmy_hm(raw.qc.data$date.performed)

raw.qc.data$tech.code <- as.numeric(raw.qc.data$tech.code) #anonymize tech codes

raw.qc.data <- arrange(raw.qc.data, instr.code, test.code)

Now that all the data munging is done, we can examine the data:

test.code	instr.code	qc.name	qc.expire	date.performed	tech.code	shift	result	modifier
BCL	JBGAS	RAD1	R0173 EXP MAR 2021	2019-11-15 09:17:00	68	2	122	NA
BCL	JBGAS	RAD1	R0173 EXP MAR 2021	2019-11-15 20:51:00	68	3	122	NA
BCL	JBGAS	RAD1	R0173 EXP MAR 2021	2019-11-15 21:47:00	68	3	122	NA
BCL	JBGAS	RAD1	R0173 EXP MAR 2021	2019-11-15 21:50:00	68	3	122	NA
BCL	JBGAS	RAD1	R0173 EXP MAR 2021	2019-11-17 07:10:00	15	1	122	NA
BCL	JBGAS	RAD1	R0173 EXP MAR 2021	2019-11-17 07:11:00	15	1	122	NA

And finally, we can make the dplyr magic happen and discard results for which the counts are too small, which I have chosen to be <20:

raw.qc.data %>% dplyr::filter(!is.na(result)) %>%
  group_by(instr.code,test.code,qc.name,qc.expire) %>%
  summarise(median = median(result),
            IQR = IQR(result),
            mean = mean(result),
            SD = sd(result),
            min = min(result),
            max = max(result),
            CV = round(sd(result, na.rm = TRUE)/mean(result, na.rm = TRUE)*100,2),
            count = n()) %>%
  filter(count ≥ 20) %>%
  arrange(instr.code, test.code, median) -> summary.table

raw.qc.data %>% dplyr::filter(!is.na(result)) %>%

group_by(instr.code,test.code,qc.name,qc.expire) %>%

summarise(median = median(result),

IQR = IQR(result),

mean = mean(result),

SD = sd(result),

min = min(result),

max = max(result),

CV = round(sd(result, na.rm = TRUE)/mean(result, na.rm = TRUE)*100,2),

count = n()) %>%

filter(count ≥ 20) %>%

arrange(instr.code, test.code, median) -> summary.table

Which gives us output like this:

head(summary.table)

1 2	head(summary.table)

instr.code	test.code	qc.name	qc.expire	median	IQR	mean	SD	min	max	CV	count
JBGAS	BCL	RAD3	R0141 EXP SEP 2017	65.0	1.000	65.145454	0.6503043	63.0	66.0	1.00	55
JBGAS	BCL	RAD2	R0175 EXP MAR 2021	97.0	0.000	97.128205	0.3364820	97.0	98.0	0.35	78
JBGAS	BCL	RAD1	R0173 EXP MAR 2021	122.0	0.000	122.122807	0.5691527	121.0	124.0	0.47	57
JBGAS	BGLUC	RAD1	R0173 EXP MAR 2021	1.5	0.000	1.507017	0.0257713	1.5	1.6	1.71	57
JBGAS	BGLUC	RAD2	R0175 EXP MAR 2021	5.6	0.075	5.585897	0.0639081	5.4	5.7	1.14	78
JBGAS	BGLUC	RAD3	R0141 EXP SEP 2017	13.7	0.100	13.763636	0.1310409	13.4	14.1	0.95	55

This permits us to toss out results with low counts. But what about handling outliers? Well, we can calculate the z-scores of the raw data by joining the the mean and SD results back to the raw data.

raw.qc.data %>%
  left_join(select(summary.table,c(instr.code:qc.expire, mean, SD)),
             by = c("test.code","instr.code", "qc.name", "qc.expire")) %>%
  mutate(z.score = (result - mean)/SD) -> raw.qc.data

raw.qc.data %>%

left_join(select(summary.table,c(instr.code:qc.expire, mean, SD)),

by = c("test.code","instr.code", "qc.name", "qc.expire")) %>%

mutate(z.score = (result - mean)/SD) -> raw.qc.data

This will permit you to suppress results outside a certain z-score. So, let’s suppress all results with an undefined z-score and all results with a z-score >= 4:

raw.qc.data %>%
  drop_na(z.score) %>%
  filter(abs(z.score) < 4) -> raw.qc.data

raw.qc.data %>%

drop_na(z.score) %>%

filter(abs(z.score) < 4) -> raw.qc.data

Now , we can re-run the dplyr summary:

raw.qc.data %>% dplyr::filter(!is.na(result)) %>%
  group_by(instr.code,test.code,qc.name,qc.expire) %>%
  summarise(median = median(result),
            IQR = IQR(result),
            mean = mean(result),
            SD = sd(result),
            min = min(result),
            max = max(result),
            CV = round(sd(result, na.rm = TRUE)/mean(result, na.rm = TRUE)*100,2),
            count = n()) %>%
  filter(count ≥ 20) %>%
  arrange(instr.code, test.code, median) -> summary.table.no.outliers

raw.qc.data %>% dplyr::filter(!is.na(result)) %>%

group_by(instr.code,test.code,qc.name,qc.expire) %>%

summarise(median = median(result),

IQR = IQR(result),

mean = mean(result),

SD = sd(result),

min = min(result),

max = max(result),

CV = round(sd(result, na.rm = TRUE)/mean(result, na.rm = TRUE)*100,2),

count = n()) %>%

filter(count ≥ 20) %>%

arrange(instr.code, test.code, median) -> summary.table.no.outliers

And now we have a summary of every QC CV in our Sunquest system with outliers suppressed:

head(summary.table.no.outliers)

1 2	head(summary.table.no.outliers)

instr.code	test.code	qc.name	qc.expire	median	IQR	mean	SD	min	max	CV	count
JBGAS	BCL	RAD3	R0141 EXP SEP 2017	65.0	1.000	65.145454	0.6503043	63.0	66.0	1.00	55
JBGAS	BCL	RAD2	R0175 EXP MAR 2021	97.0	0.000	97.128205	0.3364820	97.0	98.0	0.35	78
JBGAS	BCL	RAD1	R0173 EXP MAR 2021	122.0	0.000	122.122807	0.5691527	121.0	124.0	0.47	57
JBGAS	BGLUC	RAD1	R0173 EXP MAR 2021	1.5	0.000	1.507017	0.0257713	1.5	1.6	1.71	57
JBGAS	BGLUC	RAD2	R0175 EXP MAR 2021	5.6	0.075	5.585897	0.0639081	5.4	5.7	1.14	78
JBGAS	BGLUC	RAD3	R0141 EXP SEP 2017	13.7	0.100	13.763636	0.1310409	13.4	14.1	0.95	55

And there we have it:

SQ Screenshot1

Now I can write the output file

write_csv(summary.table.no.outliers, "QC_summary.csv")

1 2	write_csv(summary.table.no.outliers, "QC_summary.csv")

With dplyr, if you direct your energies to the right place, you reap much. Similarly:

“But seek ye first the kingdom of God, and his righteousness; and all these things shall be added unto you.”

Matthew 6:33

Break up with Excel: Intro and Advanced R Data Science Courses at MSACL.org

Salzburg Austria, September 21–23, 2019

August 30, 2019August 30, 2019 dtholmes@mail.ubc.ca

MSACL Conference

There are two RStats Data Science courses happening in Salzburg Austria on September 22–24, 2019 at the 6th annual MSACL Clinical Mass Spectrometry Conference. These courses are held twice annually, once in Europe and once in Palm Springs.

plot of chunk msaclAd

Introductory Course

The introductory course will be taught by Dan Holmes, MD of the University of British Columbia and Will Slade, PhD of Laboratory Corporation of America.
Course description is here

Intermediate/Advanced Course

The intermediate/advance course will be taught by Shannon Haymond, PhD of Northwestern University and Patrick Mathias, MD PhD of the University of Washington.
Course description is here

Registration

Although the conference is for clinical mass spectrometry, the courses are generic in nature and generally geared towards the biological and health sciences rather than mass spectometry per se.
You do not need to register for the conference to attend the pre-conference courses.
Academic rates apply. Registration for the course includes lots of snacks and coffee… like, good coffee.
Registration details are here.

Details

Both courses will take place in the following location:

Salzburg Congress Centre
- Day 1: September 22, 2019, 1300h–1800h
- Day 2: September 23, 2019, 0830h–1730h
- Day 3: September 24, 2019, 0830h–1130h

RMarkdown Template that Manages Academic Affiliations – docx or PDF output

August 26, 2019August 28, 2019 dtholmes@mail.ubc.ca

Background

I like writing my academic papers in RMarkdown because it allows reproducible research. The cleanest way to submit a manuscript made in RMarkdown is using the LaTeX code that it generates using the YAML switch keep_tex = true. A minimalist YAML header would look like so:

---
title: The document title
author: 
  - Duke A Caboom, MD
  - Justin d'Ottawa, PhD:
output: 
  pdf_document:
    keep_tex: true
---

---

title: The document title

author:

- Duke A Caboom, MD

- Justin d'Ottawa, PhD:

output:

pdf_document:

keep_tex: true

---

Introduction

However, when you want mutliple authors affiliations you discover that you can’t do as you would in LaTeX because Pandoc does not know what to do with the affiliations and you end out a dishearting PDF that looks like the output shown in figure 1 below:

Figure 1: This is so sad.

The situation worsens if you want MS-Word output. As those of us in medical fields know, most journals (with some notable exceptions like the Clinical Mass Spectrometry Journal and other Elsevier journals like Clinical Biochemistry and Clinica Chimica Acta) require submission of a document in MS-Word format which goes against all that Data Science and Reprodicible Research stands for–he says, with hyperbole. Parenthetically, it is my hope that since AACC has indicated that they intend to make Data Science a strategic priority for Lab Medicine, they will soon accept submissons to Clinical Chemistry and Journal of Applied Laboratory Medicine written reproducibly in RMardown or LaTeX.

In the mean time, here are the workarounds for getting the affiliations to display correctly along with all the other stuff we want, namely, cross referencing of figures and tables and correct reference formatting and abbreviation of journal names. This allows you to avoid the horror of manually fixing your Word document after it generated from RMarkdown. In any case, let’s start with MS-Word.

Dependencies for MS-Word and the Associated YAML

You will also need to install Pandoc which is the Swiss Army Knife of document conversion. It’s going to turn your code into a .docx file for you. Mac users can do this with Homebrew on the terminal command line:

brew install pandoc
brew install pandoc-citeproc
brew install pandoc-crossref

brew install pandoc

brew install pandoc-citeproc

brew install pandoc-crossref

There are some extra installs required to help Pandoc do its job. Install the prebuilt binaries if you can.

Finally, you need to use some scripts written in the Lua scripting language which means you will need the language itself:

lua language

And you will need two Lua scripts:

These are in Pandoc github repository:

You want the files named scholarly-metadata.lua and author-info-blocks.lua.

You will need to choose a .csl file for your journal. This will tell Pandoc how to format the references. You can download the correct .csl file here. You will also need a journal abbreviations database. I have made one for you from the Web of Science list and you can download it here.

You will need to create a .bibtex database which is just your list of references. This can be exported from various reference managers or built by hand. Name the file mybibfile.bib.

Now follow the bouncing ball:

Go to the directory containing your .Rmd file.
Create a directory in it called “Extras”
Put the two Lua scripts, the Bibtex database, the abbreviations database and the .csl file into the “Extras” folder.
If you want to avoid Pandoc’s goofy default .docx formatting, then put this word document in the same folder.

Download the contents of this folder from my github repo that has everything set up as I describe above.

For two authors, your YAML will need to look like this:

title: |
  RMarkdown Template for Managing  
  Academic Affiliations 
subtitle: |
  Also Deals with Cross References and  
  Reference Abbreviations for MS-Word Output
author:
  - Duke A Caboom, MD:
      email: duke.a.caboom@utuktoyaktuk.edu
      institute: [UofT]
      correspondence: true
  - Justin d'Ottawa, PhD:
      email: justin@neverready.ca
      institute: [UofO]
      correspondence: false
institute:
  - UofT: University of Tuktoyaktuk, CXVG+62 Tuktoyaktuk, Inuvik, Unorganized, NT Canada
  - UofO: University of Ottawa, 75 Laurier Ave E, Ottawa, ON K1N 6N5, Canada
abstract: |
  **Introduction**: There's a big scientific problem out there. I know how to fix it.
  **Methods**: My experiments are pure genius.
  **Results**: Now you have your proof.
  **Conclusion**: Give me more grant money.
journal: "An awesome journal"
date: ""
toc: false
output:
 bookdown::word_document2:
    pandoc_args:
      - --csl=Extras/clinical-mass-spectrometry.csl
      - --citation-abbreviations=Extras/abbreviations.json
      - --filter=pandoc-crossref
      - --lua-filter=Extras/scholarly-metadata.lua
      - --lua-filter=Extras/author-info-blocks.lua
      - --reference-doc=Extras/Reference_Document.docx 
bibliography: "Extras/mybibfile.bib"
keywords: "CRAN, R, RMarkdown, RStudio, YAML"

title: |

RMarkdown Template for Managing

Academic Affiliations

subtitle: |

Also Deals with Cross References and

Reference Abbreviations for MS-Word Output

author:

- Duke A Caboom, MD:

email: duke.a.caboom@utuktoyaktuk.edu

institute: [UofT]

correspondence: true

- Justin d'Ottawa, PhD:

email: justin@neverready.ca

institute: [UofO]

correspondence: false

institute:

- UofT: University of Tuktoyaktuk, CXVG+62 Tuktoyaktuk, Inuvik, Unorganized, NT Canada

- UofO: University of Ottawa, 75 Laurier Ave E, Ottawa, ON K1N 6N5, Canada

abstract: |

**Introduction**: There's a big scientific problem out there. I know how to fix it.

**Methods**: My experiments are pure genius.

**Results**: Now you have your proof.

**Conclusion**: Give me more grant money.

journal: "An awesome journal"

date: ""

toc: false

output:

bookdown::word_document2:

pandoc_args:

- --csl=Extras/clinical-mass-spectrometry.csl

- --citation-abbreviations=Extras/abbreviations.json

- --filter=pandoc-crossref

- --lua-filter=Extras/scholarly-metadata.lua

- --lua-filter=Extras/author-info-blocks.lua

- --reference-doc=Extras/Reference_Document.docx

bibliography: "Extras/mybibfile.bib"

keywords: "CRAN, R, RMarkdown, RStudio, YAML"

Et voila! Figure 2 shows that we have something reasonable.

This is so great

Figure 2: This is so great

Dependencies for LaTeX and the Associated YAML

It goes without saying that you need to install LaTeX. LaTeX markup language is available here: Mac, Windows. For Linux, just install from the command line with your package manager. Do a full install with all the glorious bloat of all LaTeX packages. This saves many headaches in the future.

You don’t need the lua scripts for LaTeX although you can use them. The issue with LaTeX is that the .tex template that Pandoc uses for generating LaTeX files does not support author affiliations as descibed in the Pandoc documentation. So what you need to do is modify the Pandoc LaTeX template. To get your current working copy of the Pandoc LaTeX template open up a terminal (Mac/Linux) and type:

pandoc -D latex > mytemplate.tex

1	pandoc -D latex > mytemplate.tex

This will push the contents to a file. Move the file to the “Extras” folder discussed above. If that seems difficult, you can also download it here. Now you have to edit it. Open it up in a text editor and find the section that reads:

$if(author)$
\author{$for(author)$author$sep$ \and $endfor$}
$endif$

$if(author)$

\author{$for(author)$author$sep$ \and $endfor$}

$endif$

Replace this with this code that will invoke the LaTeX authblk package.

$if(author)$
    \usepackage{authblk}
    $for(author)$
        $if(author.name)$
            $if(author.number)$
                \author[$author.number$]{$author.name$}
            $else$
                \author[]{$author.name$}
            $endif$
            $if(author.affiliation)$
                $if(author.email)$
                    \affil{$author.affiliation$ \thanks{$author.email$}}
                $else$
                    \affil{$author.affiliation$}
                $endif$
            $endif$
            $else$  
            \author{$author$}
        $endif$
    $endfor$
$endif$

$if(author)$

\usepackage{authblk}

$for(author)$

$if(author.name)$

$if(author.number)$

\author[$author.number$]{$author.name$}

$else$

\author[]{$author.name$}

$endif$

$if(author.affiliation)$

$if(author.email)$

\affil{$author.affiliation$ \thanks{$author.email$}}

$else$

\affil{$author.affiliation$}

$endif$

$else$

\author{$author$}

$endif$

$endfor$

$endif$

Then make your YAML header look like this:

---
title: |
  RMarkdown Template for Managing  
  Academic Affiliations 
subtitle: |
  Also Deals with Cross References and  
  Reference Abbreviations for PDF Output
author:
- name: Duke A Caboom, MD
  affiliation: University of Tuktoyaktuk, CXVG+62 Tuktoyaktuk, Inuvik, Unorganized, NT Canada
  email: dtholmes@mail.ubc.ca
  number: 1
- name: Justin d'Ottawa, PhD
  affiliation: University of Ottawa, 75 Laurier Ave E, Ottawa, ON K1N 6N5, Canada
  email: justin@neverready.ca
  number: 2
abstract: |
  **Introduction**: There's a big scientific problem out there. I know how to fix it.

  **Methods**: My experiments are pure genius.

  **Results**: Now you have your proof.

  **Conclusion**: Give me more grant money.
toc: false
output: 
  bookdown::pdf_document2:
    pandoc_args:
      - --filter=pandoc-crossref
      - --csl=Extras/clinical-mass-spectrometry.csl
      - --citation-abbreviations=Extras/abbreviations.json
      - --template=Extras/mytemplate.tex
bibliography: "Extras/mybibfile.bib"
keep-latex: true

---

title: |

RMarkdown Template for Managing

Academic Affiliations

subtitle: |

Also Deals with Cross References and

Reference Abbreviations for PDF Output

author:

- name: Duke A Caboom, MD

affiliation: University of Tuktoyaktuk, CXVG+62 Tuktoyaktuk, Inuvik, Unorganized, NT Canada

email: dtholmes@mail.ubc.ca

number: 1

- name: Justin d'Ottawa, PhD

affiliation: University of Ottawa, 75 Laurier Ave E, Ottawa, ON K1N 6N5, Canada

email: justin@neverready.ca

number: 2

abstract: |

**Introduction**: There's a big scientific problem out there. I know how to fix it.

**Methods**: My experiments are pure genius.

**Results**: Now you have your proof.

**Conclusion**: Give me more grant money.

toc: false

output:

bookdown::pdf_document2:

pandoc_args:

- --filter=pandoc-crossref

- --csl=Extras/clinical-mass-spectrometry.csl

- --citation-abbreviations=Extras/abbreviations.json

- --template=Extras/mytemplate.tex

bibliography: "Extras/mybibfile.bib"

keep-latex: true

And as you can see in figure 3 you get a correctly list of authors.

Figure 3: This is also great.

Cross Reference of a Table

Of course, tables can be cross referenced in the same manner as figures. Here is a cross reference to table 1 using the code \@ref(tab:mytable) .

Table 1: A short table
term	estimate	std.error	statistic	p.value
(Intercept)	36.908	2.191	16.847	0.000
hp	-0.019	0.015	-1.275	0.213
cyl	-2.265	0.576	-3.933	0.000

This Template also Takes Care of Reference Abbreviation.

As usual, you can make a citation with the code [@bibtexname], where bibtexname is the articles’s abbreviated handle in your bibtex database. Here is a great resource on the bookdown package [1] and reproducible research [2] and here are references where the journal title is longer [3,4]. The references in your documnent (and shown below) will have appropriate abbreviations based on the .json abbreviations database I have provided. In this case, I have chosen the .csl file for Clinical Mass Spectrometry–’cause MSACL.

Other Ways to Skin the YAML Cat

I came across some other ways to deal with this that I did not like as much but they are simpler. Here is one using a footnote.

title: The document title
author:
- [Duke A Caboom, MD]^(University of Tuktoyaktuk, CXVG+62 Tuktoyaktuk, Inuvik, Unorganized, NT Canada)
- [Justin d'Ottawa, PhD]^(University of Ottawa, 75 Laurier Ave E, Ottawa, ON K1N 6N5, Canada)
output: pdf_document

title: The document title

author:

- [Duke A Caboom, MD]^(University of Tuktoyaktuk, CXVG+62 Tuktoyaktuk, Inuvik, Unorganized, NT Canada)

- [Justin d'Ottawa, PhD]^(University of Ottawa, 75 Laurier Ave E, Ottawa, ON K1N 6N5, Canada)

output: pdf_document

And you can also misuse the date variable:

title: The document title
author:
- Duke A Caboom, MD [1]
- Justin d'Ottawa, PhD [2]
date: 1. University of Tuktoyaktuk, CXVG+62 Tuktoyaktuk, Inuvik, Unorganized, NT Canada \newline 2. University of Ottawa, 75 Laurier Ave E, Ottawa, ON K1N 6N5, Canada
output: pdf_document

title: The document title

author:

- Duke A Caboom, MD [1]

- Justin d'Ottawa, PhD [2]

date: 1. University of Tuktoyaktuk, CXVG+62 Tuktoyaktuk, Inuvik, Unorganized, NT Canada \newline 2. University of Ottawa, 75 Laurier Ave E, Ottawa, ON K1N 6N5, Canada

output: pdf_document

Conclusion

This concludes my long personal struggle to get a completely reproducible .docx manusript genereated by RMarkdown and Pandoc. Here is the output for PDF and Word.

Parting Thought

Let us not become weary in doing good, for at the proper time we will reap a harvest if we do not give up.

Galations 6:9

References

[1] Y. Xie, J.J. Allaire, G. Grolemund, R markdown: The definitive guide, Chapman; Hall/CRC, 2018. https://bookdown.org/yihui/bookdown.

[2] R.D. Peng, Reproducible research in computational science, Science. 334 (2011) 1226–1227.

[3] G. Eisenhofer, C. Durán, T. Chavakis, C.V. Cannistraci, Steroid metabolomics: Machine learning and multidimensional diagnostics for adrenal cortical tumors, hyperplasias, and related disorders, Curr. Opin. Endocr. Metab. Res. 8 (2019) 40–49. doi:https://doi.org/10.1016/j.coemr.2019.07.002.

[4] F.B. Vicente, D.C. Lin, S. Haymond, Automation of chromatographic peak review and order to result data transfer in a clinical mass spectrometry laboratory, Clin. Chim. Acta. 498 (2019) 84–89. doi:https://doi.org/10.1016/j.cca.2019.08.004.

Reproducible Research: Write your Clinical Chemistry paper using R Markdown

February 5, 2018June 26, 2018 dtholmes@mail.ubc.ca

Abstract
Background: This blog post is going to show you how to write a reproducible article in the field of clinical chemistry using R Markdown. The only thing that will change for journal to journal will be the reference fomating and perhaps section numbering. The source code itself will be provided so that you can use it as a template.

Methods: The paper will use R, R-Markdown, bookdown and pandoc. The references will be taken care of using BibTeX and reference formatting will be managed with Zotero csl files.

Results: The result will be a manuscript that anyone can reproduce.

Conclusions: R Markdown makes reproducible research through literate programming pretty easy.

1 Background

Last week at the MSACL conference Dr. Keith Baggerly from MD Anderson Cancer Centre’s Bioformatics and Computational Biology Group spoke about the importance of reproducible research using the Duke University ovarian cancer biomarker scandal as a backdrop. The talk…was…incredible and illustrated how easy it is to introduce catastrophic errors into your research papers through the use of GUI analytical tools. The now retracted article that Baggerly dismantled is here. I urge everyone in our field to watch similar talks from Keith discussing biomarker analysis in mass spectrometric proteomic data and microarray data. Shannon Haymond and I were then discussing how to make a submission in the field of clinical chemistry that is reproducible. While this article will not discuss the basics of R and R Markdown, it will serve as a guide for those who know a little about these and give you a working YAML header and get the citations and cross-references correct.

2 Overhead

2.1 YAML

RMarkdown articles require a YAML header to instruct R Markdown how to process your article. This is the YAML code that worked for me. I am sure there are other ways to do this:

---
title: "Reproducible Research: Write your Clinical Chemistry paper using R Markdown"
author: 
- Daniel Holmes, MD
date: "February 04, 2018"
documentclass: "article"
header-includes:
   - \usepackage{amsmath}
output:
  #bookdown::pdf_document2:
  #bookdown::word_document2:
  bookdown::html_document2:
    toc: no
    pandoc_args: [
      "--csl", "clinical-chemistry.csl" , "--citation-abbreviations", "abbreviations.json"
    ]
bibliography: bibliography.bib
abstract: |
  **Background:** Put
  
  **Methods:** Your
  
  **Results:** Abstract
  
  **Conclusions:** Here
  
keywords: "bla bla bla"
---

---

title: "Reproducible Research: Write your Clinical Chemistry paper using R Markdown"

author:

- Daniel Holmes, MD

date: "February 04, 2018"

documentclass: "article"

header-includes:

- \usepackage{amsmath}

output:

#bookdown::pdf_document2:

#bookdown::word_document2:

bookdown::html_document2:

toc: no

pandoc_args: [

"--csl", "clinical-chemistry.csl" , "--citation-abbreviations", "abbreviations.json"

]

bibliography: bibliography.bib

abstract: |

**Background:** Put

**Methods:** Your

**Results:** Abstract

**Conclusions:** Here

keywords: "bla bla bla"

---

Indenting and spacing really matter a lot in YAML so don’t mess with them. You can generate PDF, MS Word or HTML output as needed by uncommenting and commenting out the output type as appropriate in the YAML above.

2.2 CSL Files

CSL files take care of citation formatting for you. Depending on what journal you are making submission to, you will need a different .csl file. They exist for every conceivable journal. In the world of non-reproducible reports, this process is taken care of by reference managers in GUI word processors but since GUI word processors do not produce reproducible research, we must break up with them.

From the YAML, you will see that you need a file called “clinical-chemistry.csl” which I downloaded from here. Put this file in the same folder as your R Markdown file. The Clinical Chemistry .csl depends on the .csl file of the American Association for Cancer Research but it will be downloaded for you on the fly. If you need a different reference format, search the CSL GitHub repository for the appropriate file.

2.3 BibTeX

Reference management in R Markdown is taken care of by BibTeX. You can see from the YAML that we need a bibliography text file called “bibliography.bib”. You can name it whatever you like but you will need to change the YAML accordingly. In any case, any citation you intend to make will have to be in the .bib file. I am going to cite Shannon Haymond because this article was her idea and I will toss in a couple of other references so you can see that they get cited in order as we would like.

Below is my bibliography.bib file. I put it in the same folder as my R Markdown file. You make a .bib file using a text file editor (or RStudio) by cutting and pasting the BibTex citations from Google Scholar

@article{shannon2017,
  title={Contribution of symmetric dimethylarginine to GFR decline in pediatric chronic kidney disease},
  author={Brooks, Ellen R and Haymond, Shannon and Rademaker, Alfred and Pierce, Christopher and Helenowski, Irene and Passman, Rod and Vicente, Faye and Warady, Bradley A and Furth, Susan L and Langman, Craig B},
  journal={Pediatric Nephrology},
  pages={1--8},
  year={2017},
  publisher={Springer}
}

@article{li2017wellness,
  title={Wellness Initiatives: Benefits and Limitations},
  author={Li, Michelle and Diamandis, Eleftherios P and Paneth, Nigel and Yeo, Kiang-Teck J and Vogt, Henrik and Master, Stephen R},
  journal={Clinical Chemistry},
  volume={63},
  number={6},
  pages={1063--1068},
  year={2017},
  publisher={Clinical Chemistry}
}

@article{holmes2005preanalytical,
  title={Preanalytical influences on DPC IMMULITE 2000 intact PTH assays of plasma and serum from dialysis patients},
  author={Holmes, Daniel T and Levin, Adeera and Forer, Barry and Rosenberg, Frances},
  journal={Clinical Chemistry},
  volume={51},
  number={5},
  pages={915--917},
  year={2005},
  publisher={Clinical Chemistry}
}

@article{shannon2017,

title={Contribution of symmetric dimethylarginine to GFR decline in pediatric chronic kidney disease},

author={Brooks, Ellen R and Haymond, Shannon and Rademaker, Alfred and Pierce, Christopher and Helenowski, Irene and Passman, Rod and Vicente, Faye and Warady, Bradley A and Furth, Susan L and Langman, Craig B},

journal={Pediatric Nephrology},

pages={1--8},

year={2017},

publisher={Springer}

}

@article{li2017wellness,

title={Wellness Initiatives: Benefits and Limitations},

author={Li, Michelle and Diamandis, Eleftherios P and Paneth, Nigel and Yeo, Kiang-Teck J and Vogt, Henrik and Master, Stephen R},

journal={Clinical Chemistry},

volume={63},

number={6},

pages={1063--1068},

year={2017},

publisher={Clinical Chemistry}

}

@article{holmes2005preanalytical,

title={Preanalytical influences on DPC IMMULITE 2000 intact PTH assays of plasma and serum from dialysis patients},

author={Holmes, Daniel T and Levin, Adeera and Forer, Barry and Rosenberg, Frances},

journal={Clinical Chemistry},

volume={51},

number={5},

pages={915--917},

year={2005},

publisher={Clinical Chemistry}

}

2.4 Journal Abbreviations

Now, the fussiest thing I had to do was get the references abbreviating properly. We need an abbreviation database. Fortunately, I could download the abbreviation database from the Web of Science as a .csv file and then convert it to a JSON file. This was Stephen’s idea. I had to deal with a couple of badly behaving characters from some journal titles. This script, if embedded in your document, will download the .csv for you and then make the abbreviation database for you. That way your citations will say, “J Clin Pathol” and not “Journal of Clinical Pathology” etc.

```{r, echo = FALSE}
if(!require('RJSONIO')){install.packages('RJSONIO')}
if(!file.exists("abbreviations.json")){
download.file("https://ndownloader.figshare.com/files/5212423","wos_abbrev_table.csv")
  abbrev <- read.csv("wos_abbrev_table.csv", sep = ";", header = TRUE, stringsAsFactors = FALSE)
  abbrev$full <- gsub("\\", "\\\\",abbrev$full, fixed = TRUE)
  abbrev.list <- list('default' = list('container-title' = abbrev$abbrev.dots))
  names(abbrev.list$default$`container-title`) = abbrev$full
  write(toJSON(abbrev.list), "abbreviations.json")
  rm(abbrev)
  rm(abbrev.list)
}
```

```{r, echo = FALSE}

if(!require('RJSONIO')){install.packages('RJSONIO')}

if(!file.exists("abbreviations.json")){

download.file("https://ndownloader.figshare.com/files/5212423","wos_abbrev_table.csv")

abbrev <- read.csv("wos_abbrev_table.csv", sep = ";", header = TRUE, stringsAsFactors = FALSE)

abbrev$full <- gsub("\\", "\\\\",abbrev$full, fixed = TRUE)

abbrev.list <- list('default' = list('container-title' = abbrev$abbrev.dots))

names(abbrev.list$default$`container-title`) = abbrev$full

write(toJSON(abbrev.list), "abbreviations.json")

rm(abbrev)

rm(abbrev.list)

}

```

2.5 Citation Management

Now we can proceed to cite articles from our .bib file freely inserting syntax like this: [@shannon2017]. Shannon wrote this interesting article on symmetric dimethylarginine (1), Stephen wrote about wellness initiatives with Dr. Diamandis (2) and Dan wrote a paper about PTH when he was a resident (3). That makes for a total of 3 citations in this manuscript. (1–3)

3 Reporting Your Results

3.1 Figures

You can embed figures in R Markdown as local images or hyperlinks as follows:

![Grumpy Cat does not care about reproducibility](grumpy.jpg)

Grumpy Cat does not care about reproducibility

If you need to do cross-referencing of figures in your document which will change automatically for you if you insert another figure, you can do this by inserting your figure with an R code-chunk and giving the code-chunk a name.

```{r ancient-aliens, fig.height=3, fig.width=2, fig.cap="This is the ancient aliens guy.", echo = FALSE}
knitr::include_graphics('ancient_aliens.jpg')
```

```{r ancient-aliens, fig.height=3, fig.width=2, fig.cap="This is the ancient aliens guy.", echo = FALSE}

knitr::include_graphics('ancient_aliens.jpg')

```

Figure 3.1: This is the ancient aliens guy.

Then you can reference your code chunk using syntax like this: See figure \@ref(fig:ancient-aliens)

and you will automatically create appropriately cross-referenced figures that get automatically numbered like this: See figure 3.1. Of course the hallmark of the reproducibility to embed R code right into the document. See for example figure 3.2

```{r example-code, fig.cap = "This is a reproducible figure"}
set.seed(10)
x <- runif(100,0,100)
y <- x + rnorm(100,0,0.10)*x
plot(x,y,
main = "Reproducible Figure",
pch = 16,
col = "blue",
xlab = "Current Method (mmol/L)",
ylab = "New Method (mmol/L)")
abline(lm(y~x), col = "red", pch = 2)
```

```{r example-code, fig.cap = "This is a reproducible figure"}

set.seed(10)

x <- runif(100,0,100)

y <- x + rnorm(100,0,0.10)*x

plot(x,y,

main = "Reproducible Figure",

pch = 16,

col = "blue",

xlab = "Current Method (mmol/L)",

ylab = "New Method (mmol/L)")

abline(lm(y~x), col = "red", pch = 2)

```

This is a reproducible figure

Figure 3.2: This is a reproducible figure

3.2 Inline Calculations

When you are reporting your amazing results you can have inline code calculations by syntax that looks like this: `r round(median(x),1)` and would result in the median value of $x$ being reported as 43.9 mmol/L.

3.3 Tables

Tables are not a problem either and can be made with the kable() function of the knitr package or with the xtable package. Tables can be crossreferenced analogously to figures. See table 3.1

```{r example-table, echo = FALSE, results = 'asis'}
library(knitr)
a <- 1:5
b <- 2:6
c <- a*b
z <- data.frame(a,b,c)
kable(z,
  caption = "This is a great table", 
  col.names = c("First","Second", "Third"))
```

```{r example-table, echo = FALSE, results = 'asis'}

library(knitr)

a <- 1:5

b <- 2:6

c <- a*b

z <- data.frame(a,b,c)

kable(z,

caption = "This is a great table",

col.names = c("First","Second", "Third"))

```

Table 3.1: This is a great table
First	Second	Third
1	2	2
2	3	6
3	4	12
4	5	20
5	6	30

3.4 Math

Math works pretty magically using $\LaTeX$ syntax. For example inline math can be done like so $\sin^2x + \cos^2x = 1$ . This will result in: $\sin^2x + \cos^2x = 1$. And math can also be done as a code block like so:

$$
\oint_S {E_n dA = \frac{1}{{\varepsilon _0 }}} Q_{inside}
$$

\oint_S {E_n dA = \frac{1}{{\varepsilon _0 }}} Q_{inside}

Gauss’ Law says: \[
\oint_S {E_n dA = \frac{1}{{\varepsilon _0 }}} Q_{inside}
\]

Equations can be cross-referenced just like tables and figures.

4 Conclusion

I hope this makes writing a reproducible paper easier for you. A minimal template to produce output in PDF is here. And the PDF output itself is here. You’ll need $\LaTeX$ installed of course.

Parting Thought

You can cite books too and get Greek letters too:

“Very truly I tell you,” Jesus answered, “before Abraham was born, $\varepsilon\gamma\omega$ $\varepsilon\iota\mu\iota$ (I Am)!” (4)

References

1. Brooks ER, Haymond S, Rademaker A, Pierce C, Helenowski I, Passman R, et al. Contribution of symmetric dimethylarginine to gfr decline in pediatric chronic kidney disease. Pediatr Nephrol. Springer; 2017;1–8.

2. Li M, Diamandis EP, Paneth N, Yeo K-TJ, Vogt H, Master SR. Wellness initiatives: Benefits and limitations. Clin Chem. Clinical Chemistry; 2017;63:1063–8.

3. Holmes DT, Levin A, Forer B, Rosenberg F. Preanalytical influences on DPC IMMULITE 2000 intact PTH assays of plasma and serum from dialysis patients. Clin Chem. Clinical Chemistry; 2005;51:915–7.

4. John the A, Chist J, Spirit H, God F. The Gospel According to John, 8:58. 1 Heavenly Way: Hosannah Press; 80AD.

Non-Linear Regression: Application to Monoclonal Peak Integration in Serum Protein Electrophoresis

August 28, 2017August 28, 2017 dtholmes@mail.ubc.ca

Background

At the AACC meeting recently, there was an enthusiastic discussion of standardization of reporting for serum protein electrophoresis (SPEP) presented by a working group headed up by Dr. Chris McCudden and Dr. Ron Booth, both of the University of Ottawa. One of the discussions pertained to how monoclonal bands, especially small ones, should be integrated. While many use the default manual vertical gating or “drop” method offered by Sebia's Phoresis software, Dr. David Keren was discussing the value of tangent skimming as a more repeatable and effective means of monoclonal protein quantitation. He was also discussing some biochemical approaches distinguishing monoclonal proteins from the background gamma proteins.

The drop method is essentially an eye-ball approach to where the peak starts and ends and is represented by the vertical lines and the enclosed shaded area.

plot of chunk unnamed-chunk-1

The tangent skimming approach is easier to make reproducible. In the mass spectrometry world it is a well-developed approach with a long history and multiple algorithms in use. This is apparently the book. However, when tangent skimming is employed in SPEP, unless I am mistaken, it seems to be done by eye. The integration would look like this:

plot of chunk unnamed-chunk-2

During the discussion it was point out that peak deconvolution of the monoclonal protein from the background gamma might be preferable to either of the two described procedures. By this I mean integration as follows:

plot of chunk unnamed-chunk-3

There was discussion this procedure is challenging for number of reasons. Further, it should be noted that there will only likely be any clinical value in a deconvolution approach when the concentration of the monoclonal protein is low enough that manual integration will show poor repeatability, say < 5 g/L = 0.5 g/dL.

Easy Peaks

Fitting samples with larger monoclonal peaks is fairly easy. Fitting tends to converge nicely and produce something meaningful. For example, using the approach I am about to show below, an electropherogram like this:

plot of chunk unnamed-chunk-4

with a gamma region looking like this:

plot of chunk unnamed-chunk-5

can be deconvoluted with straightforward non-linear regression (and no baseline subtraction) to yield this:

plot of chunk unnamed-chunk-6

and the area of the green monoclonal peak is found to be 5.3%.

More Difficult Peaks

What is more challenging is the problem of small monoclonals buried in normal $\gamma$-globulins. These could be difficult to integrate using a tangent skimming approach, particularly without image magnification. For the remainder of this post we will use a gel with a small monoclonal in the fast gamma region shown at the arrow.

plot of chunk unnamed-chunk-7

Getting the Data

EP data can be extracted from the PDF output from any electrophoresis software. This is not complicated and can be accomplished with pdf2svg or Inkscape and some Linux bash scripting. I'm sure we can get it straight from the instrument but it is not obvious to me how to do this. One could also rescan a gel and use ImageJ to produce a densitometry scan which is discussed in the ImageJ documentation and on YouTube. ImageJ also has a macro language for situations where the same kind of processing is done repeatedly.

Smoothing

The data has 10284 pairs of (x,y) data. But if you blow up on it and look carefully you find that it is a series of staircases.

plot(y~x, data = head(ep.data,100), type = "o", cex = 0.5)

1 2	plot(y~x, data = head(ep.data,100), type = "o", cex = 0.5)

plot of chunk unnamed-chunk-8

It turns out that this jaggedness significantly impairs attempts to numerically identify the peaks and valleys. So, I smoothed it a little using the handy rle() function to identify the midpoint of each step. This keeps the total area as close to its original value as possible–though this probably does not matter too much.

ep.rle <- rle(ep.data$y)
stair.midpoints <- cumsum(ep.rle$lengths) - floor(ep.rle$lengths/2)
ep.data.sm <- ep.data[stair.midpoints,]
plot(y~x, data = head(ep.data,300), type = "o", cex = 0.5)
points(y~x, data = head(ep.data.sm,300), type = "o", cex = 0.5, col = "red")

ep.rle <- rle(ep.data$y)

stair.midpoints <- cumsum(ep.rle$lengths) - floor(ep.rle$lengths/2)

ep.data.sm <- ep.data[stair.midpoints,]

plot(y~x, data = head(ep.data,300), type = "o", cex = 0.5)

points(y~x, data = head(ep.data.sm,300), type = "o", cex = 0.5, col = "red")

plot of chunk unnamed-chunk-9

Now that we are satisfied that the new data is OK, I will overwrite the original dataframe.

ep.data <- ep.data.sm

1 2	ep.data <- ep.data.sm

Transformation

The units on the x and y-axes are arbitrary and come from page coordinates of the PDF. We can normalize the scan by making the x-axis go from 0 to 1 and by making the total area 1.

library(Bolstad) #A package containing a function for Simpon's Rule integration
ep.data$x <- ep.data$x/max(ep.data$x)
A.tot <- sintegral(ep.data$x,ep.data$y)$value
ep.data$y <- ep.data$y/A.tot

#sanity check
sintegral(ep.data$x,ep.data$y)$value

library(Bolstad) #A package containing a function for Simpon's Rule integration

ep.data$x <- ep.data$x/max(ep.data$x)

A.tot <- sintegral(ep.data$x,ep.data$y)$value

ep.data$y <- ep.data$y/A.tot

#sanity check

sintegral(ep.data$x,ep.data$y)$value

## [1] 1

## [1] 1

plot(y~x, data = ep.data, type = "l")

1 2	plot(y~x, data = ep.data, type = "l")

plot of chunk unnamed-chunk-11

Find Extrema

Using the findPeaks function from the quantmod package we can find the minima and maxima:

library(quantmod)
ep.max <- findPeaks(ep.data$y)
plot(y~x, data = ep.data, type = "l", main = "Maxima")
abline(v = ep.data$x[ep.max], col = "red", lty = 2)

library(quantmod)

ep.max <- findPeaks(ep.data$y)

plot(y~x, data = ep.data, type = "l", main = "Maxima")

abline(v = ep.data$x[ep.max], col = "red", lty = 2)

plot of chunk unnamed-chunk-12

ep.min <- findValleys(ep.data$y)
plot(y~x, data = ep.data, type = "l", main = "Minima")
abline(v = ep.data$x[ep.min], col = "blue", lty = 2)

ep.min <- findValleys(ep.data$y)

plot(y~x, data = ep.data, type = "l", main = "Minima")

abline(v = ep.data$x[ep.min], col = "blue", lty = 2)

plot of chunk unnamed-chunk-12

Not surprisingly, there are some extraneous local extrema that we do not want. I simply manually removed them. Generally, this kind of thing could be tackled with more smoothing of the data prior to analysis.

ep.max <- ep.max[-1]
ep.min <- ep.min[-c(1,length(ep.min))]

ep.max <- ep.max[-1]

ep.min <- ep.min[-c(1,length(ep.min))]

Fitting

Now it's possible with the nls() function to fit the entire SPEP with a series of Gaussian curves simultaneously. It works just fine (provided you have decent initial estimates of $\mu_i$ and $\sigma_i$) but there is no particular clinical value to fitting the albumin, $\alpha_1$, $\alpha_2$, $\beta_1$ and $\beta_2$ domains with Gaussians. What is of interest is separately quantifying the two peaks in $\gamma$ with two separate Gaussians so let's isolate the $\gamma$ region based on the location of the minimum between $\beta_2$ and $\gamma$.

Isolate the $\gamma$ Region

gamma.ind <- max(ep.min):nrow(ep.data)
gamma.data <- data.frame(x = ep.data$x[gamma.ind], y = ep.data$y[gamma.ind])
plot(y ~ x, gamma.data, type  = "l")

gamma.ind <- max(ep.min):nrow(ep.data)

gamma.data <- data.frame(x = ep.data$x[gamma.ind], y = ep.data$y[gamma.ind])

plot(y ~ x, gamma.data, type = "l")

plot of chunk unnamed-chunk-14

Attempt Something that Ultimately Does Not Work

At first I thought I could just throw two normal distributions at this and it would work. However, it does not work well at all and this kind of not-so-helpful fit turns out to happen a fair bit. I use the nls() function here which is easy to call. It requires a functional form which I set to be:

\[y = C_1 \exp\Big(-{\frac{(x-\mu_1)^2}{2\sigma_1^2}}\Big) + C_2 \exp \Big({-\frac{(x-\mu_2)^2}{2\sigma_2^2}}\Big)\]

where $\mu_1$ is the $x$ location of the first peak in $\gamma$ and $\mu_2$ is the $x$ location of the second peak in $\gamma$. The estimates of $\sigma_1$ and $\sigma_2$ can be obtained by trying to estimate the full-width-half-maximum (FWHM) of the peaks, which is related to $\sigma$ by

\[FWHM_i = 2 \sqrt{2\ln2} \times \sigma_i = 2.355 \times \sigma_i\]

I had to first make a little function that returns the respective half-widths at half-maximum and then uses them to estimate the $FWHM$. Because the peaks are poorly resolved, it also tries to get the smallest possible estimate returning this as FWHM2.

FWHM.finder <- function(ep.data, mu.index){
  peak.height <- ep.data$y[mu.index]
  fxn.for.roots <- ep.data$y - peak.height/2
  indices <- 1:nrow(ep.data)
  root.indices <- which(diff(sign(fxn.for.roots))!=0)
  tmp <- c(root.indices,mu.index) %>% sort
  tmp2 <- which(tmp == mu.index)
  first.root <- root.indices[tmp2 -1]
  second.root <- root.indices[tmp2]
  HWHM1 <- ep.data$x[mu.index] - ep.data$x[first.root]
  HWHM2 <- ep.data$x[second.root] - ep.data$x[mu.index]
  FWHM <- HWHM2 + HWHM1
  FWHM2 = 2*min(c(HWHM1,HWHM2))
  return(list(HWHM1 = HWHM1,HWHM2 = HWHM2,FWHM = FWHM,FWHM2 = FWHM2))
}

FWHM.finder <- function(ep.data, mu.index){

peak.height <- ep.data$y[mu.index]

fxn.for.roots <- ep.data$y - peak.height/2

indices <- 1:nrow(ep.data)

root.indices <- which(diff(sign(fxn.for.roots))!=0)

tmp <- c(root.indices,mu.index) %>% sort

tmp2 <- which(tmp == mu.index)

first.root <- root.indices[tmp2 -1]

second.root <- root.indices[tmp2]

HWHM1 <- ep.data$x[mu.index] - ep.data$x[first.root]

HWHM2 <- ep.data$x[second.root] - ep.data$x[mu.index]

FWHM <- HWHM2 + HWHM1

FWHM2 = 2*min(c(HWHM1,HWHM2))

return(list(HWHM1 = HWHM1,HWHM2 = HWHM2,FWHM = FWHM,FWHM2 = FWHM2))

}

The peak in the $\gamma$ region was obtained previously:

plot(y ~ x, gamma.data, type  = "l")
gamma.max <- findPeaks(gamma.data$y)
abline(v = gamma.data$x[gamma.max])

plot(y ~ x, gamma.data, type = "l")

gamma.max <- findPeaks(gamma.data$y)

abline(v = gamma.data$x[gamma.max])

plot of chunk unnamed-chunk-16

and from them $\mu_1$ is determined to be 0.7. We have to guess where the second peak is, which is at about $x=0.75$ and has an index of 252 in the gamma.data dataframe.

gamma.data[252,]

1 2	gamma.data[252,]

##             x         y
## 252 0.7487757 0.6381026

1 2	## x y ## 252 0.7487757 0.6381026

#append the second peak
gamma.max <- c(gamma.max,252)
gamma.mu <- gamma.data$x[gamma.max]
gamma.mu

#append the second peak

gamma.max <- c(gamma.max,252)

gamma.mu <- gamma.data$x[gamma.max]

gamma.mu

## [1] 0.6983350 0.7487757

1	## [1] 0.6983350 0.7487757

plot(y ~ x, gamma.data, type  = "l")
abline(v = gamma.data$x[gamma.max])

plot(y ~ x, gamma.data, type = "l")

abline(v = gamma.data$x[gamma.max])

plot of chunk unnamed-chunk-17

Now we can find the estimates of the standard deviations:

#find the FWHM estimates of sigma_1 and sigma_2:
FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.data)
gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355

#find the FWHM estimates of sigma_1 and sigma_2:

FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.data)

gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355

The estimates of $\sigma_1$ and $\sigma_2$ are now obtained. The estimates of $C_1$ and $C_2$ are just the peak heights.

peak.heights <- gamma.data$y[gamma.max]

1 2	peak.heights <- gamma.data$y[gamma.max]

We can now use nls() to determine the fit.

fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) +
                  C2*exp(-(x-mean2)**2/(2 * sigma2**2))),
           data = gamma.data,
           start = list(mean1 = gamma.mu[1],
                        mean2 = gamma.mu[2],
                        sigma1 = gamma.sigma[1],
                        sigma2 = gamma.sigma[2],
                        C1 = peak.heights[1],
                        C2 = peak.heights[2]),
           algorithm = "port")

fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) +

C2*exp(-(x-mean2)**2/(2 * sigma2**2))),

data = gamma.data,

start = list(mean1 = gamma.mu[1],

mean2 = gamma.mu[2],

sigma1 = gamma.sigma[1],

sigma2 = gamma.sigma[2],

C1 = peak.heights[1],

C2 = peak.heights[2]),

algorithm = "port")

Determining the fitted values of our unknown coefficients:

dffit <- data.frame(x=seq(0, 1 , 0.001))
dffit$y <- predict(fit, newdata=dffit)

fit.sum <- summary(fit)
fit.sum #show the fitted coefficients

dffit <- data.frame(x=seq(0, 1 , 0.001))

dffit$y <- predict(fit, newdata=dffit)

fit.sum <- summary(fit)

fit.sum #show the fitted coefficients

## 
## Formula: y ~ (C1 * exp(-(x - mean1)^2/(2 * sigma1^2)) + C2 * exp(-(x - 
##     mean2)^2/(2 * sigma2^2)))
## 
## Parameters:
##         Estimate Std. Error t value Pr(>|t|)    
## mean1  0.7094793  0.0003312 2142.23   <2e-16 ***
## mean2  0.7813900  0.0007213 1083.24   <2e-16 ***
## sigma1 0.0731113  0.0002382  306.94   <2e-16 ***
## sigma2 0.0250850  0.0011115   22.57   <2e-16 ***
## C1     0.6983921  0.0018462  378.29   <2e-16 ***
## C2     0.0819704  0.0032625   25.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01291 on 611 degrees of freedom
## 
## Algorithm "port", convergence message: both X-convergence and relative convergence (5)

## Formula: y ~ (C1 * exp(-(x - mean1)^2/(2 * sigma1^2)) + C2 * exp(-(x -

## mean2)^2/(2 * sigma2^2)))

## Parameters:

## Estimate Std. Error t value Pr(>|t|)

## mean1 0.7094793 0.0003312 2142.23 <2e-16 ***

## mean2 0.7813900 0.0007213 1083.24 <2e-16 ***

## sigma1 0.0731113 0.0002382 306.94 <2e-16 ***

## sigma2 0.0250850 0.0011115 22.57 <2e-16 ***

## C1 0.6983921 0.0018462 378.29 <2e-16 ***

## C2 0.0819704 0.0032625 25.12 <2e-16 ***

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Residual standard error: 0.01291 on 611 degrees of freedom

## Algorithm "port", convergence message: both X-convergence and relative convergence (5)

coef.fit <- fit.sum$coefficients[,1]
mu.fit <- coef.fit[1:2]
sigma.fit <- coef.fit[3:4]
C.fit <- coef.fit[5:6]

coef.fit <- fit.sum$coefficients[,1]

mu.fit <- coef.fit[1:2]

sigma.fit <- coef.fit[3:4]

C.fit <- coef.fit[5:6]

And now we can plot the fitted results against the original results:

#original
plot(y ~ x, data = gamma.data, type = "l", main = "This is Garbage") 
#overall fit
lines(y ~ x, data = dffit, col ="red", cex = 0.2) 
legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum"))
#components of the fit
for(i in 1:2){
  x <- dffit$x
  y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2))
  lines(x,y, col = i + 2)
}

#original

plot(y ~ x, data = gamma.data, type = "l", main = "This is Garbage")

#overall fit

lines(y ~ x, data = dffit, col ="red", cex = 0.2)

legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum"))

#components of the fit

for(i in 1:2){

x <- dffit$x

y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2))

lines(x,y, col = i + 2)

}

plot of chunk unnamed-chunk-22

And this is garbage. The green curve is supposed to be the monoclonal peak, the blue curve is supposed to be the $\gamma$ background, and the red curve is their sum, the overall fit. This is a horrible failure.

Subsequently, I tried fixing the locations of $\mu_1$ and $\mu_2$ but this also yielded similar nonsensical fitting. So, with a lot of messing around trying different functions like the lognormal distribution, the Bi-Gaussian distribution and the Exponentially Modified Gaussian distribution, and applying various arbitrary weighting functions, and simultaneously fitting the other regions of the SPEP, I concluded that nothing could predictably produce results that represented the clinical reality.

I thought maybe the challenge to obtain a reasonable fit related to the sloping baseline, so I though I would try to remove it. I will model the baseline in the most simplistic manner possible: as a sloped line.

Baseline Removal

I will arbitrarily define the tail of the $\gamma$ region to be those values having $y \leq 0.02$. Then I will connect the first (x,y) point from the $\gamma$ region and connect it to the tail.

gamma.tail <- filter(gamma.data, y <= 0.02) 
baseline.data <- rbind(gamma.data[1,],gamma.tail)
names(baseline.data) <- c("x","y")
baseline.fun <- approxfun(baseline.data)
plot(y~x, data = gamma.data, type = "l")
lines(baseline.data$x,baseline.fun(baseline.data$x), col = "blue")

gamma.tail <- filter(gamma.data, y <= 0.02)

baseline.data <- rbind(gamma.data[1,],gamma.tail)

names(baseline.data) <- c("x","y")

baseline.fun <- approxfun(baseline.data)

plot(y~x, data = gamma.data, type = "l")

lines(baseline.data$x,baseline.fun(baseline.data$x), col = "blue")

plot of chunk unnamed-chunk-24

Now we can define a new dataframe gamma.no.base that has the baseline removed:

gamma.no.base <- data.frame(x = gamma.data$x, y = gamma.data$y - baseline.fun(gamma.data$x))
plot(y~x, data = gamma.data, type = "l")
lines(y ~ x, data = gamma.no.base, lty = 2)
gamma.max <- findPeaks(gamma.no.base$y)[1:2] #rejects a number of extraneous peaks
abline(v = gamma.no.base$x[gamma.max])

gamma.no.base <- data.frame(x = gamma.data$x, y = gamma.data$y - baseline.fun(gamma.data$x))

plot(y~x, data = gamma.data, type = "l")

lines(y ~ x, data = gamma.no.base, lty = 2)

gamma.max <- findPeaks(gamma.no.base$y)[1:2] #rejects a number of extraneous peaks

abline(v = gamma.no.base$x[gamma.max])

plot of chunk unnamed-chunk-25

The black is the original $\gamma$ and the dashed has the baseline removed. This becomes and easy fit.

#Estimate the Ci
peak.heights <- gamma.no.base$y[gamma.max]
#Estimate the mu_i
gamma.mu <- gamma.no.base$x[gamma.max] #the same values as before
#Estimate the sigma_i from the FWHM
FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.no.base)
gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355

#Perform the fit
fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) +
                  C2*exp(-(x-mean2)**2/(2 * sigma2**2))),
           data = gamma.no.base,
           start = list(mean1 = gamma.mu[1],
                        mean2 = gamma.mu[2],
                        sigma1 = gamma.sigma[1],
                        sigma2 = gamma.sigma[2],
                        C1 = peak.heights[1],
                        C2 = peak.heights[2]),
           algorithm = "port")

#Plot the fit
dffit <- data.frame(x=seq(0, 1 , 0.001))
dffit$y <- predict(fit, newdata=dffit)
fit.sum <- summary(fit)
coef.fit <- fit.sum$coefficients[,1]
mu.fit <- coef.fit[1:2]
sigma.fit <- coef.fit[3:4]
C.fit <- coef.fit[5:6]

plot(y ~ x, data = gamma.no.base, type = "l")
legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum"))
lines(y ~ x, data = dffit, col ="red", cex = 0.2)
for(i in 1:2){
  x <- dffit$x
  y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2))
  lines(x,y, col = i + 2)
}

#Estimate the Ci

peak.heights <- gamma.no.base$y[gamma.max]

#Estimate the mu_i

gamma.mu <- gamma.no.base$x[gamma.max] #the same values as before

#Estimate the sigma_i from the FWHM

FWHM <- lapply(gamma.max, FWHM.finder, ep.data = gamma.no.base)

gamma.sigma <- unlist(sapply(FWHM, '[', 'FWHM2'))/2.355

#Perform the fit

fit <- nls(y ~ (C1*exp(-(x-mean1)**2/(2 * sigma1**2)) +

C2*exp(-(x-mean2)**2/(2 * sigma2**2))),

data = gamma.no.base,

start = list(mean1 = gamma.mu[1],

mean2 = gamma.mu[2],

sigma1 = gamma.sigma[1],

sigma2 = gamma.sigma[2],

C1 = peak.heights[1],

C2 = peak.heights[2]),

algorithm = "port")

#Plot the fit

dffit <- data.frame(x=seq(0, 1 , 0.001))

dffit$y <- predict(fit, newdata=dffit)

fit.sum <- summary(fit)

coef.fit <- fit.sum$coefficients[,1]

mu.fit <- coef.fit[1:2]

sigma.fit <- coef.fit[3:4]

C.fit <- coef.fit[5:6]

plot(y ~ x, data = gamma.no.base, type = "l")

legend("topright", lty = c(1,1,1), col = c("black", "green", "blue","red"), c("Scan", "Monoclonal", "Gamma", "Sum"))

lines(y ~ x, data = dffit, col ="red", cex = 0.2)

for(i in 1:2){

x <- dffit$x

y <- C.fit[i] *exp(-(x-mu.fit[i])**2/(2 * sigma.fit[i]**2))

lines(x,y, col = i + 2)

}

plot of chunk unnamed-chunk-26

Lo and behold…something that is not completely insane. The green is the monoclonal, the blue is the $\gamma$ background and the red is their sum, that is, the overall fit. A better fit could now we sought with weighting or with a more flexible distribution shape. In any case, the green peak is now easily determined. Since

\[\int_{-\infty}^{\infty} C_1 \exp\Big(-{\frac{(x-\mu_1)^2}{2\sigma_1^2}}\Big)dx = \sqrt{2\pi}\sigma C_1\]

A.mono <- sqrt(2*pi)*sigma.fit[1]*C.fit[1] %>% unname() 
A.mono <- round(A.mono,3)
A.mono

A.mono <- sqrt(2*pi)*sigma.fit[1]*C.fit[1] %>% unname()

A.mono <- round(A.mono,3)

A.mono

## sigma1 
##  0.024

1 2	## sigma1 ## 0.024

So this peak is 2.4% of the total area. Now, of course, this assumes that nothing under the baseline is attributable to the monoclonal peak and all belongs to normal $\gamma$-globulins, which is very unlikely to be true. However, the drop and tangent skimming methods also make assumptions about how the area under the curve contributes to the monoclonal protein. The point is to try to do something that will produce consistent results that can be followed over time. Obviously, if you thought there were three peaks in the $\gamma$-region, you'd have to set up your model accordingly.

All about that Base(line)

There are obviously better ways to model the baseline because this approach of a linear baseline is not going to work in situations where, for example, there is a small monoclonal in fast $\gamma$ dwarfed by normal $\gamma$-globulins. That is, like this:

plot of chunk unnamed-chunk-28

Something curvilinear or piecewise continuous and flexible enough for more circumstances is generally required.

There is also no guarantee that baseline removal, whatever the approach, is going to be a good solution in other circumstances. Given the diversity of monoclonal peak locations, sizes and shapes, I suspect one would need a few different approaches for different circumstances.

Conclusions

The data in the PDFs generated by EP software are processed (probably with splining or similar) followed by the stair-stepping seen above. It would be better to work with raw data from the scanner.
- This is particularly important if you are using nls() because nls() does not play nice with data having no noise (“Do not use nls on artificial 'zero-residual' data”)
Integrating monoclonal peaks under the $\gamma$ baseline (or $\beta$) is unlikely to be a one-size-fits all approach and may require application of a number of strategies to get meaningful results.
- Basline removal might be helpful at times.
Peak integration will require human adjudication.
While most monoclonal peaks show little skewing, better fitting is likely to be obtained with distributions that afford some skewing.
MASSFIX may soon make this entire discussion irrelevant.

Parting Thought

On the matter of fitting

In bringing many sons and daughters to glory, it was fitting that God, for whom and through whom everything exists, should make the pioneer of their salvation perfect through what he suffered.

Heb 2:10

Compare Tube Types with R – Repeated Measures ANOVA

August 21, 2017February 23, 2019 dtholmes@mail.ubc.ca

Background

Sometimes we might want to compare three or four tube types for a particular analyte on a group of patients or we might want to see if a particular analyte is stable over time in aliqioted samples. In these experiments are essentially doing the multivariable analogue of the paired t-test. In the tube-type experiment, the factor that is differing between the (‘paired’) groups is the container: serum separator tubes (SST), EDTA plasma tubes, plasma separator tubes (PST) etc. In a stability experiment, the factor that is differing is storage duration.

Since this is a fairly common clinical lab experiment, I thought I would just jot down how this is accomplished in R – though I must confess I know just about $\lim_{x\to0}x$ about statistics. In any case, the statistical test is a repeated-measures ANOVA and this is one way to do it (there are many) including an approach to the post-hoc testing.

Some Fake Data to Work With

I’m going to make some fake data. I tried to dig up the data from an experiment I did as a resident but alas, I think the raw data died on an old laptop. But fake data will do for demonstration purposes. Let’s suppose we are looking at parathyroid hormone (PTH) in three different blood collection tubes: SST, EDTA and PST. For the sake of argument, let’s say that we collect samples from 20 patients simultaneously and we anlayze them all as per our usual process. This means that each patient has three samples of material that should be otherwise identical outside of the effects of the collection contained.

library(magrittr)
set.seed(100) #to force the same pseudo-random each time
#data in pmol/L
#induce some heteroscedastic error
SST <- runif(20,3,50)  
PST <- 1.03*SST + rnorm(20,0,0.1)*SST #set the data up to show no difference
EDTA <- 1.15*SST + rnorm(20,0,0.1)*SST  #set the data up to show a difference
tube.data <- data.frame(SST,PST,EDTA) %>% round(.,1)
tube.data <- data.frame(Subject = factor(1:20), tube.data)

library(magrittr)

set.seed(100) #to force the same pseudo-random each time

#data in pmol/L

#induce some heteroscedastic error

SST <- runif(20,3,50)

PST <- 1.03*SST + rnorm(20,0,0.1)*SST #set the data up to show no difference

EDTA <- 1.15*SST + rnorm(20,0,0.1)*SST #set the data up to show a difference

tube.data <- data.frame(SST,PST,EDTA) %>% round(.,1)

tube.data <- data.frame(Subject = factor(1:20), tube.data)

This is the way we usually express (and receive) data like this in an Excel spreadsheet:

Subject	SST	PST	EDTA
1	17.5	18.1	19.9
2	15.1	15.7	20.0
3	29.0	29.2	32.9
4	5.7	6.2	6.4
5	25.0	26.1	27.0
6	25.7	26.4	29.0
7	41.2	40.8	48.1
8	20.4	22.1	24.3
9	28.7	26.9	36.0
10	11.0	13.9	13.7
11	32.4	31.9	36.9
12	44.5	49.2	57.4
13	16.2	17.1	15.7
14	21.7	24.1	26.3
15	38.8	36.8	42.6
16	34.4	34.0	44.2
17	12.6	12.1	14.1
18	19.8	20.9	25.4
19	19.9	18.2	23.0
20	35.4	37.4	34.1

This Excel-ish way of storing the data is referred to as the “datawide” format for obvious reasons.

Gather the Grain

As it turns out this is not the way that we want to store data to do the statistical analyses of interest. What we want to do is have the tube type in a single column because this is the factor that is different within the subjects. We want to gather() or melt() the data (depending on your package of choice) to be like so:

library(tidyr)
tube.data.2 <- gather(tube.data, key = "Subject")
tube.data.2 %>% kable()

library(tidyr)

tube.data.2 <- gather(tube.data, key = "Subject")

tube.data.2 %>% kable()

Subject	Subject	value
1	SST	17.5
2	SST	15.1
3	SST	29.0
4	SST	5.7
5	SST	25.0
6	SST	25.7
7	SST	41.2
8	SST	20.4
9	SST	28.7
10	SST	11.0
11	SST	32.4
12	SST	44.5
13	SST	16.2
14	SST	21.7
15	SST	38.8
16	SST	34.4
17	SST	12.6
18	SST	19.8
19	SST	19.9
20	SST	35.4
1	PST	18.1
2	PST	15.7
3	PST	29.2
4	PST	6.2
5	PST	26.1
6	PST	26.4
7	PST	40.8
8	PST	22.1
9	PST	26.9
10	PST	13.9
11	PST	31.9
12	PST	49.2
13	PST	17.1
14	PST	24.1
15	PST	36.8
16	PST	34.0
17	PST	12.1
18	PST	20.9
19	PST	18.2
20	PST	37.4
1	EDTA	19.9
2	EDTA	20.0
3	EDTA	32.9
4	EDTA	6.4
5	EDTA	27.0
6	EDTA	29.0
7	EDTA	48.1
8	EDTA	24.3
9	EDTA	36.0
10	EDTA	13.7
11	EDTA	36.9
12	EDTA	57.4
13	EDTA	15.7
14	EDTA	26.3
15	EDTA	42.6
16	EDTA	44.2
17	EDTA	14.1
18	EDTA	25.4
19	EDTA	23.0
20	EDTA	34.1

Now we see that there is a column for tube-type and a column for the PTH results which we can name accordingly. You can see why this called the “datalong” format.

names(tube.data.2) <- c("Subject", "Tube.Type", "PTH")
tube.data.2$Tube.Type <- as.factor(tube.data.2$Tube.Type) #turns tube type into factor

names(tube.data.2) <- c("Subject", "Tube.Type", "PTH")

tube.data.2$Tube.Type <- as.factor(tube.data.2$Tube.Type) #turns tube type into factor

Visualize

Summarize the data:

summary(tube.data)

1 2	summary(tube.data)

##     Subject        SST             PST             EDTA      
##  1      : 1   Min.   : 5.70   Min.   : 6.20   Min.   : 6.40  
##  2      : 1   1st Qu.:17.18   1st Qu.:17.85   1st Qu.:19.98  
##  3      : 1   Median :23.35   Median :25.10   Median :26.65  
##  4      : 1   Mean   :24.75   Mean   :25.36   Mean   :28.85  
##  5      : 1   3rd Qu.:32.90   3rd Qu.:32.42   3rd Qu.:36.23  
##  6      : 1   Max.   :44.50   Max.   :49.20   Max.   :57.40  
##  (Other):14

## Subject SST PST EDTA

## 1 : 1 Min. : 5.70 Min. : 6.20 Min. : 6.40

## 2 : 1 1st Qu.:17.18 1st Qu.:17.85 1st Qu.:19.98

## 3 : 1 Median :23.35 Median :25.10 Median :26.65

## 4 : 1 Mean :24.75 Mean :25.36 Mean :28.85

## 5 : 1 3rd Qu.:32.90 3rd Qu.:32.42 3rd Qu.:36.23

## 6 : 1 Max. :44.50 Max. :49.20 Max. :57.40

## (Other):14

Let’s just have a quick look graphically:

library(mcr)
plot(mcreg(SST, EDTA,
           method.reg = "PaBa",
           mref.name = "SST",
           mtest.name = "EDTA"))

library(mcr)

plot(mcreg(SST, EDTA,

method.reg = "PaBa",

mref.name = "SST",

mtest.name = "EDTA"))

plot of chunk unnamed-chunk-6

plot(mcreg(SST, PST,
           method.reg = "PaBa",
           mref.name = "SST",
           mtest.name = "PST"))

plot(mcreg(SST, PST,

method.reg = "PaBa",

mref.name = "SST",

mtest.name = "PST"))

plot of chunk unnamed-chunk-6

And as a boxplot with the points overtop:

boxplot(PTH ~ Tube.Type,
        data = tube.data.2,
        col = c("purple", "lightgreen", "gold"))
stripchart(PTH ~ Tube.Type,
           vertical = TRUE,
           data = tube.data.2, 
           method = "jitter",
           add = TRUE,
           pch = 20,
           col = rgb(0,0,0,0.5))

boxplot(PTH ~ Tube.Type,

data = tube.data.2,

col = c("purple", "lightgreen", "gold"))

stripchart(PTH ~ Tube.Type,

vertical = TRUE,

data = tube.data.2,

method = "jitter",

add = TRUE,

pch = 20,

col = rgb(0,0,0,0.5))

plot of chunk unnamed-chunk-7

Separate the Wheat from the Chaff

Now we want to make comparisons to see if these are different. To accomplish this, we will use the aov() function. This requires us to have data formatted “datalong” as it is in the tube.data.2 dataframe.

fit <- aov(PTH ~ Tube.Type + Error(Subject/Tube.Type), data=tube.data.2)

1 2	fit <- aov(PTH ~ Tube.Type + Error(Subject/Tube.Type), data=tube.data.2)

If you are like me, this syntax is confusing. But it goes like this. PTH is a function of Tube.Type which is straight forward–hence the PTH ~ Tube.Type bit. The error term has the Subject in front of the / and the factor that is different within the subjects (Tube.Type) after the /. That’s my grade 2 explanation from reading this and this and this.

summary(fit)

1 2	summary(fit)

## 
## Error: Subject
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals 19   7307   384.6               
## 
## Error: Subject:Tube.Type
##           Df Sum Sq Mean Sq F value   Pr(>F)    
## Tube.Type  2  195.9   97.97   22.47 3.63e-07 ***
## Residuals 38  165.7    4.36                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## Error: Subject

## Df Sum Sq Mean Sq F value Pr(>F)

## Residuals 19 7307 384.6

## Error: Subject:Tube.Type

## Df Sum Sq Mean Sq F value Pr(>F)

## Tube.Type 2 195.9 97.97 22.47 3.63e-07 ***

## Residuals 38 165.7 4.36

## ---

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This tells us that there is a difference between the groups but it does not specify where the difference is.

I can’t see the difference. Can you see the difference?

Sorry – I just had to make a pop-culture reference to this. We want to be specific about where the differences are without making a Type I error which might arise if we blindly charge ahead and do multiple paired t-tests. One easy way to accomplish this is to use the pairwise.t.test() function which does corrections for multiple comparisons. You can choose from a number of approaches for adjustment for pairwise comparison. This requires the “response vector” which is PTH and the “grouping factor” which is the tube type.

# choices for p.adjust.method are: c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")
pwt <- pairwise.t.test(tube.data.2$PTH, tube.data.2$Tube.Type, p.adj = "bonferroni", paired = TRUE)
pwt

# choices for p.adjust.method are: c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none")

pwt <- pairwise.t.test(tube.data.2$PTH, tube.data.2$Tube.Type, p.adj = "bonferroni", paired = TRUE)

pwt

## 
##  Pairwise comparisons using paired t tests 
## 
## data:  tube.data.2$PTH and tube.data.2$Tube.Type 
## 
##     EDTA    PST    
## PST 0.00083 -      
## SST 7.9e-05 0.35033
## 
## P value adjustment method: bonferroni

## Pairwise comparisons using paired t tests

## data: tube.data.2$PTH and tube.data.2$Tube.Type

## EDTA PST

## PST 0.00083 -

## SST 7.9e-05 0.35033

## P value adjustment method: bonferroni

This is pretty easy to understand. There are statistically significant differences found between the EDTA and PST (p = 0.00083) and the EDTA and PST (p = 0.00008) but none between SST and PST (p = 0.35033).

Conclusion

Non-statistician’s approach to tube-type comparisons, which is also applicable to analyte stability studies. This is a one-way repeated measures ANOVA with one within-subjects factor. There is a great deal more to say on the matter by people who know much more in the citations in the links provided above.

God probably uses datawide format

All the nations will be gathered before him, and he will separate the people one from another as a shepherd separates the sheep from the goats. He will put the sheep on his right and the goats on his left.

(Matt 25:32-33)

Parse an Online Table into an R Dataframe – Westgard’s Biological Variation Database

August 14, 2017August 14, 2017 dtholmes@mail.ubc.ca

Background

From time to time I have wanted to bring an online table into an R dataframe. While in principle, the data can be cut and paste into Excel, sometimes the table is very large and sometimes the columns get goofed up in the process. Fortunately, there are a number of R tools for accomplishing this. I am just going to show one approach using the rvest package. The rvest package also makes it possible to interact with forms on webpages to request specific material which can then be scraped. I think you will see the potential if you look here.

In our (simple) case, we will apply this process to Westgard's desirable assay specifications as shown on his website. The goal is to parse out the biological variation tables, get them into a dataframe and the write to csv or xlsx.

Reading in the Data

The first thing to do is to load the rvest and httr packages and define an html session with the html_session() function.

library(rvest)
library(httr)
wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("LabRtorian"))

library(rvest)

library(httr)

wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("LabRtorian"))

Now looking at the webpage, you can see that there are 8 columns in the tables of interest. So, we will define an empty dataframe with 8 columns.

#define empty table to hold all the content
biotable = data.frame(matrix(NA,0, 8))

#define empty table to hold all the content

biotable = data.frame(matrix(NA,0, 8))

We need to know which part of the document to scrape. This is a little obscure, but following the instructions in this post, we can determine that the xpaths we need are:

/html/body/div[1]/div[3]/div/main/article/div/table[1]

/html/body/div[1]/div[3]/div/main/article/div/table[2]

/html/body/div[1]/div[3]/div/main/article/div/table[3]

…

etc.

There are 8 such tables in the whole webpage. We can define a character vector for these as such:

xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

1 2	xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

Now we make a loop to scrape the 8 tables and with each iteration of the loop, append the scraped subtable to the main dataframe called biotable using the rbind() function. We have to use the parameter fill = TRUE in the html_table() function because the table does not happen to always a uniform number of columns.

for (j in 1:8){                
  subtable <- wg %>%
  read_html() %>%
  html_nodes(xpath =  xpaths[j]) %>%
  html_table(., fill = TRUE) 
  subtable <- subtable[[1]]
  biotable <- rbind(biotable,subtable)
}

for (j in 1:8){

subtable <- wg %>%

read_html() %>%

html_nodes(xpath = xpaths[j]) %>%

html_table(., fill = TRUE)

subtable <- subtable[[1]]

biotable <- rbind(biotable,subtable)

}

Clean Up

Now that we have the raw data out, we can have a quick look at it:

X1	X2	X3	X4	X5	X6	X7	X8
	Analyte	Number of Papers	Biological Variation	Biological Variation	Desirable specification	Desirable specification	Desirable specification
	Analyte	Number of Papers	CVI	CVg	I(%)	B(%)	TE(%)
S-	11-Desoxycortisol	2	21.3	31.5	10.7	9.5	27.1
S-	17-Hydroxyprogesterone	2	19.6	50.4	9.8	13.5	29.7
U-	4-hydroxy-3-methoximandelate (VMA)	1	22.2	47.0	11.1	13.0	31.3
S-	5' Nucleotidase	2	23.2	19.9	11.6	7.6	26.8
U-	5'-Hydroxyindolacetate, concentration	1	20.3	33.2	10.2	9.7	26.5
S-	α1-Acid Glycoprotein	3	11.3	24.9	5.7	6.8	16.2
S-	α1-Antichymotrypsin	1	13.5	18.3	6.8	5.7	16.8
S-	α1-Antitrypsin	3	5.9	16.3	3.0	4.3	9.2

We can see that we need define column names and we need to get rid of some rows containing extraneous column header information. There are actually 8 such sets of headers to remove.

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")
names(biotable) <- table.header

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")

names(biotable) <- table.header

Let's now find rows we don't want and remove them.

for.removal <- grep("Analyte", biotable$Analyte)
biotable <- biotable[-for.removal,]

for.removal <- grep("Analyte", biotable$Analyte)

biotable <- biotable[-for.removal,]

You will find that the table has missing data which is written as “- – -”. This should be now replaced by NA and the column names should be assigned to sequential integers. Also, we will remove all the minus signs after the specimen type. I'm not sure what they add.

biotable[biotable == "---"] <- NA
row.names(biotable) <- 1:nrow(biotable)
biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

biotable[biotable == "---"] <- NA

row.names(biotable) <- 1:nrow(biotable)

biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

Check it Out

Just having another look at the first 10 rows:

Sample	Analyte	NumPapers	CVI	CVG	I	B	TE
S	11-Desoxycortisol	2	21.3	31.5	10.7	9.5	27.1
S	17-Hydroxyprogesterone	2	19.6	50.4	9.8	13.5	29.7
U	4-hydroxy-3-methoximandelate (VMA)	1	22.2	47.0	11.1	13.0	31.3
S	5' Nucleotidase	2	23.2	19.9	11.6	7.6	26.8
U	5'-Hydroxyindolacetate, concentration	1	20.3	33.2	10.2	9.7	26.5
S	α1-Acid Glycoprotein	3	11.3	24.9	5.7	6.8	16.2
S	α1-Antichymotrypsin	1	13.5	18.3	6.8	5.7	16.8
S	α1-Antitrypsin	3	5.9	16.3	3.0	4.3	9.2
S	α1-Globulins	2	11.4	22.6	5.7	6.3	15.7
U	α1-Microglobulin, concentration, first morning	1	33.0	58.0	16.5	16.7	43.9

Now examining the structure:

str(biotable)

1 2	str(biotable)

## 'data.frame':    370 obs. of  8 variables:
##  $ Sample   : chr  "S" "S" "U" "S" ...
##  $ Analyte  : chr  "11-Desoxycortisol" "17-Hydroxyprogesterone" "4-hydroxy-3-methoximandelate (VMA)" "5' Nucleotidase" ...
##  $ NumPapers: chr  "2" "2" "1" "2" ...
##  $ CVI      : chr  "21.3" "19.6" "22.2" "23.2" ...
##  $ CVG      : chr  "31.5" "50.4" "47.0" "19.9" ...
##  $ I        : chr  "10.7" "9.8" "11.1" "11.6" ...
##  $ B        : chr  "9.5" "13.5" "13.0" "7.6" ...
##  $ TE       : chr  "27.1" "29.7" "31.3" "26.8" ...

## 'data.frame': 370 obs. of 8 variables:

## $ Sample : chr "S" "S" "U" "S" ...

## $ Analyte : chr "11-Desoxycortisol" "17-Hydroxyprogesterone" "4-hydroxy-3-methoximandelate (VMA)" "5' Nucleotidase" ...

## $ NumPapers: chr "2" "2" "1" "2" ...

## $ CVI : chr "21.3" "19.6" "22.2" "23.2" ...

## $ CVG : chr "31.5" "50.4" "47.0" "19.9" ...

## $ I : chr "10.7" "9.8" "11.1" "11.6" ...

## $ B : chr "9.5" "13.5" "13.0" "7.6" ...

## $ TE : chr "27.1" "29.7" "31.3" "26.8" ...

It's kind-of undesirable to have numbers as characters so…

#convert appropriate columns to numeric
biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

#convert appropriate columns to numeric

biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

Write the Data

Using the xlsx package, you can output the table to an Excel file in the current working directory.

library(xlsx)
write.xlsx(biotable,
            file = "Westgard_Biological_Variation.xlsx",
            row.names = FALSE)

library(xlsx)

write.xlsx(biotable,

file = "Westgard_Biological_Variation.xlsx",

row.names = FALSE)

If you are having trouble getting xlsx to install, then just write as csv.

write.csv(biotable,
            file = "Westgard_Biological_Variation.csv",
            row.names = FALSE)

write.csv(biotable,

file = "Westgard_Biological_Variation.csv",

row.names = FALSE)

Conclusion

You can now use the same general approach to parse any table you have web access to, no mater how small or big it is. Here is a complete script in one place:

library(httr)
library(rvest)
library(xlsx)

wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("yournamehere"))
xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

#define empty dataframe
biotable = data.frame(matrix(NA,0, 8))

#loop over the 8 html tables
for (j in 1:8){                
  subtable <- wg %>%
  read_html() %>%
  html_nodes(xpath =  xpaths[j] ) %>%
  html_table(., fill = TRUE) 
  subtable <- subtable[[1]]
  biotable <- rbind(biotable,subtable)
}

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")
names(biotable) <- table.header

#remove extraneous rows
for.removal <- grep("Analyte", biotable$Analyte)
biotable <- biotable[-for.removal,]

#make missing data into NA
biotable[ biotable == "---" ] <- NA
row.names(biotable) <- 1:nrow(biotable)

#convert appropriate columns to numeric
biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

#get rid of minus signs in column 1
biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

write.xlsx(biotable,
            file = "Westgard_Biological_Variation.xlsx",
            row.names = FALSE)

write.csv(biotable,
            file = "Westgard_Biological_Variation.csv",
            row.names = FALSE)

library(httr)

library(rvest)

library(xlsx)

wg <- html_session("https://www.westgard.com/biodatabase1.htm", user_agent("yournamehere"))

xpaths <- paste0("/html/body/div[1]/div[3]/div/main/article/div/table[", 1:8, "]")

#define empty dataframe

biotable = data.frame(matrix(NA,0, 8))

#loop over the 8 html tables

for (j in 1:8){

subtable <- wg %>%

read_html() %>%

html_nodes(xpath = xpaths[j] ) %>%

html_table(., fill = TRUE)

subtable <- subtable[[1]]

biotable <- rbind(biotable,subtable)

}

table.header <- c("Sample", "Analyte" ,"NumPapers", "CVI", "CVG", "I", "B","TE")

names(biotable) <- table.header

#remove extraneous rows

for.removal <- grep("Analyte", biotable$Analyte)

biotable <- biotable[-for.removal,]

#make missing data into NA

biotable[ biotable == "---" ] <- NA

row.names(biotable) <- 1:nrow(biotable)

#convert appropriate columns to numeric

biotable[,3:8] <- lapply(biotable[3:8], as.numeric)

#get rid of minus signs in column 1

biotable$Sample <- gsub("-", "", biotable$Sample, fixed = TRUE)

write.xlsx(biotable,

file = "Westgard_Biological_Variation.xlsx",

row.names = FALSE)

write.csv(biotable,

file = "Westgard_Biological_Variation.csv",

row.names = FALSE)

Parting Thought on Tables

You prepare a table before me in the presence of my enemies. You anoint my head with oil; my cup overflows.

(Psalm 23:5)

Determine the CV of a Calculated Lab Reportable – Bioavailable Testosterone

August 7, 2017August 7, 2017 dtholmes@mail.ubc.ca

Background

At the AACC meeting last week, some of my friends were bugging me that I had not made a blog post in 10 months. Without getting into it too much, let's just say I can blame Cerner. Thanks also to a prod from a friend, here is an approach to a fairly common problem.

We all report calculated quantities out of our laboratories–quantities such as LDL cholesterol, non-HDL cholesterol, aldosterone:renin ratio, free testosterone, eGFR etc. How does one determine the precision (i.e. imprecision) of a calculated quantity. While earlier in my life, I might go to the trouble of trying to do such calculations analytically using the rules of error propagation, in my later years, I am more pragmatic and I'm happy to use a computational approach.

In this example, we will model the precision in calculated bioavailable testosterone (CBAT). Without explanation, I provide an R function for CBAT (and free testosterone) where testosterone is reported in nmol/L, sex hormone binding globulin (SHBG) is reported in nmol/L, and albumin is reported in g/L. Using the Vermeulen Equation as discussed in this publication, you can calculate CBAT as follows:

cbat <- function(TT,SHBG,ALB = 43){
    Kalb <- 3.6*10^4
    Kshbg <- 10^9
    N <- 1 + Kalb*ALB/69000
    a <- N*Kshbg
    b <- N + Kshbg*(SHBG - TT)/10^9
    c <- -TT/10^9
    FT <- (-b + sqrt(b^2 - 4*a*c))/(2*a)*10^9
    cbat <- N*FT
    return(list(free.T = FT, cbat = cbat))
}

cbat <- function(TT,SHBG,ALB = 43){

Kalb <- 3.6*10^4

Kshbg <- 10^9

N <- 1 + Kalb*ALB/69000

a <- N*Kshbg

b <- N + Kshbg*(SHBG - TT)/10^9

c <- -TT/10^9

FT <- (-b + sqrt(b^2 - 4*a*c))/(2*a)*10^9

cbat <- N*FT

return(list(free.T = FT, cbat = cbat))

}

To sanity-check this, we can use this online calculator. Taking a typical male testosterone of 20 nmol/L, an SHBG of 50 nmol/L and an albumin of 43 g/L, we get the following:

cbat(20,50)

1 2	cbat(20,50)

## $free.T
## [1] 0.3273049
## 
## $cbat
## [1] 7.670319

## $free.T

## [1] 0.3273049

## $cbat

## [1] 7.670319

which is confirmed by the online calculator. Because the function is vectorized, we an submit a vector of testosterone results and SHBG results and get a vector of CBAT results.

cbat(c(10,20,30), c(40,50,60))

1 2	cbat(c(10,20,30), c(40,50,60))

## $free.T
## [1] 0.1738837 0.3273049 0.4661380
## 
## $cbat
## [1]  4.074926  7.670319 10.923842

## $free.T

## [1] 0.1738837 0.3273049 0.4661380

## $cbat

## [1] 4.074926 7.670319 10.923842

Precision of Components

We now need some precision data for the three components. However, in our lab, we just substitute 43 g/L for the albumin, so we will leave that term out of the analysis and limit our precision calculation to testosterone and SHBG. This will allow us to present the precision as surface plots as a function of total testosterone and SHBG.

We do testosterone by LC-MS/MS using Deborah French's method. In the last three months, the precision has been 3.9% at 0.78 nmol/L, 5.5% at 6.7 nmol/L, 5.2% at 18.0 nmol/L, and 6.0% at 28.2 nmol/L. We are using the Roche Cobas e601 SHBG method which, according to the package insert, has precision of 1.8% at 14.9 nmol/L, 2.1 % at 45.7 nmol/L, and 4.0% at 219 nmol/L.

cv.tt <- c(3.9, 5.5, 5.2, 6.0)
conc.tt <- c(0.78, 6.7, 18.0, 28.2)
tt.df <- data.frame(conc.tt,cv.tt)

plot(cv.tt ~ conc.tt, data = tt.df,
                    main = "Precision Profile of Testosterone",
                    xlab = "Testosterone (nmol/L)",
                    ylab = "CV Testosterone (%)",
                    ylim = c(0,8),
                    type = "o")

cv.tt <- c(3.9, 5.5, 5.2, 6.0)

conc.tt <- c(0.78, 6.7, 18.0, 28.2)

tt.df <- data.frame(conc.tt,cv.tt)

plot(cv.tt ~ conc.tt, data = tt.df,

main = "Precision Profile of Testosterone",

xlab = "Testosterone (nmol/L)",

ylab = "CV Testosterone (%)",

ylim = c(0,8),

type = "o")

plot of chunk unnamed-chunk-4

cv.shbg <- c(1.8, 2.1, 4.0)
conc.shbg <- c(14.9,45.7,219)
shbg.df <- data.frame(cv.shbg, conc.shbg)
plot(cv.shbg ~ conc.shbg, data = shbg.df,
                    main = "Precision Profile of SHBG",
                    xlab = "SHBG (nmol/L)",
                    ylab = "CV SHGB (%)",
                    ylim = c(0,5),
                    type = "o")

cv.shbg <- c(1.8, 2.1, 4.0)

conc.shbg <- c(14.9,45.7,219)

shbg.df <- data.frame(cv.shbg, conc.shbg)

plot(cv.shbg ~ conc.shbg, data = shbg.df,

main = "Precision Profile of SHBG",

xlab = "SHBG (nmol/L)",

ylab = "CV SHGB (%)",

ylim = c(0,5),

type = "o")

plot of chunk unnamed-chunk-4

Build Approximation Functions

We will want to generate linear interpolations of these precision profiles. Generally, we might watnt to use non-linear regression to do this but I will just linearly interpolate with the approxfun() function. This will allow us to just call a function to get the approximate CV at concentrations other than those for which we have data.

tt.fun <- approxfun(x = tt.df$conc.tt, y = tt.df$cv.tt)
shbg.fun <- approxfun(x = shbg.df$conc.shbg, y = shbg.df$cv.shbg)

tt.fun <- approxfun(x = tt.df$conc.tt, y = tt.df$cv.tt)

shbg.fun <- approxfun(x = shbg.df$conc.shbg, y = shbg.df$cv.shbg)

Now, if we want to know the precision of SHBG at, say, 100 nmol/L, we can just write,

shbg.fun(100)

1 2	shbg.fun(100)

## [1] 2.695326

1	## [1] 2.695326

to obtain our precision result.

Random Simulation

Now let's build a grid of SHBG and total testosterone (TT) values at which we will calculate the precision for CBAT.

shbg <- seq(from = 15, to = 200, by = 5)
tt <- seq(from = 1, to = 28, by = 1)

shbg <- seq(from = 15, to = 200, by = 5)

tt <- seq(from = 1, to = 28, by = 1)

At each point on the grid, we will have to generate, say, 100000 random TT values and 100000 random SHBG values with the appropriate precision and then calculate the expected precision of CBAT at those concentrations.

Let's do this for a single pair of concentrations by way of example modelling the random analytical error as Gaussian using the rnorm() function.

# [SHBG] = 15 nmol/L
# [TT] = 5.0 nmol/L
set.seed(100) #just to get consistent results
rng.tt <- rnorm(100000, mean = 5.0, sd = tt.fun(5.0)/100*5.0)
rng.shbg <- rnorm(100000, mean = 15, sd = shbg.fun(15)/100*15)
rng.cbat <- cbat(rng.tt, rng.shbg)
cv.cbat <- sd(rng.cbat$cbat)/mean(rng.cbat$cbat)*100
cv.cbat

# [SHBG] = 15 nmol/L

# [TT] = 5.0 nmol/L

set.seed(100) #just to get consistent results

rng.tt <- rnorm(100000, mean = 5.0, sd = tt.fun(5.0)/100*5.0)

rng.shbg <- rnorm(100000, mean = 15, sd = shbg.fun(15)/100*15)

rng.cbat <- cbat(rng.tt, rng.shbg)

cv.cbat <- sd(rng.cbat$cbat)/mean(rng.cbat$cbat)*100

cv.cbat

## [1] 5.30598

1	## [1] 5.30598

So, we can build the process of calculating the CV of CBAT into a function as follows:

cbat.cv <- function(TT, SHBG, N = 100000){
  rng.tt <- rnorm(N, mean = TT, sd = tt.fun(TT)/100*TT)
  rng.shbg <- rnorm(N, mean = SHBG, sd = shbg.fun(SHBG)/100*SHBG)
  rng.cbat <- cbat(rng.tt, rng.shbg)
  cv <- sd(rng.cbat$cbat)/mean(rng.cbat$cbat)*100
  return(cv)
}

cbat.cv <- function(TT, SHBG, N = 100000){

rng.tt <- rnorm(N, mean = TT, sd = tt.fun(TT)/100*TT)

rng.shbg <- rnorm(N, mean = SHBG, sd = shbg.fun(SHBG)/100*SHBG)

rng.cbat <- cbat(rng.tt, rng.shbg)

cv <- sd(rng.cbat$cbat)/mean(rng.cbat$cbat)*100

return(cv)

}

Now, we can make a matrix of the data for presenting a plot, calculating the CV and appending it to the dataframe.

cv.grid <- expand.grid(tt, shbg)
names(cv.grid) <- c("tt", "shbg")
cv.grid$cv.cbat <- mapply(cbat.cv, cv.grid$tt, cv.grid$shbg)

cv.grid <- expand.grid(tt, shbg)

names(cv.grid) <- c("tt", "shbg")

cv.grid$cv.cbat <- mapply(cbat.cv, cv.grid$tt, cv.grid$shbg)

Now make plot using the wireframe() function.

library(lattice)
wireframe(cv.cbat ~ tt*shbg, data = cv.grid,
          xlab = "Testo \n (nmol/L)",
          ylab = "SHBG \n (nmol/L)",
          zlab = "CV \n (%)",
          drape = TRUE,
          colorkey = TRUE,
          col.regions = colorRampPalette(c("blue", "red", "yellow"))(100),
          scales = list(arrows=FALSE,cex=.5,tick.number = 10)
          )

library(lattice)

wireframe(cv.cbat ~ tt*shbg, data = cv.grid,

xlab = "Testo \n (nmol/L)",

ylab = "SHBG \n (nmol/L)",

zlab = "CV \n (%)",

drape = TRUE,

colorkey = TRUE,

col.regions = colorRampPalette(c("blue", "red", "yellow"))(100),

scales = list(arrows=FALSE,cex=.5,tick.number = 10)

)

plot of chunk unnamed-chunk-11

This shows us that the CV of CBAT ranges from about 4–8% over the TT and SHBG ranges we have looked at.

Conclusion

We have determined the CV of calculated bioavailable testosterone using random number simulations using empirical CV data and produced a surface plot of CV. This allows us to comment on the CV of this lab reportable as a function of the two variables by which it is determined.

Parting Thought on Monte Carlo Simulations

The die is cast into the lap, but its every decision is from the LORD.

(Prov 16:33)

Background

Some Basic Stuff and Sanity Checking

Applying FFT to a Gaussian

Apply to a Real Gamma Region

Introduction

Data Cleansing

Final Outcome

Background

Getting the Raw Data

Getting it intro R and parsing it

MSACL Conference

Introductory Course

Intermediate/Advanced Course

Registration

Details

Background

Introduction

Dependencies for MS-Word and the Associated YAML

Dependencies for LaTeX and the Associated YAML

Cross Reference of a Table

This Template also Takes Care of Reference Abbreviation.

Other Ways to Skin the YAML Cat

Conclusion

Parting Thought

References

1 Background

2 Overhead

2.1 YAML

2.2 CSL Files

2.3 BibTeX

2.4 Journal Abbreviations

2.5 Citation Management

3 Reporting Your Results

3.1 Figures

3.2 Inline Calculations

3.3 Tables

3.4 Math

4 Conclusion

References

Background

Easy Peaks

More Difficult Peaks

Getting the Data

Smoothing

Transformation

Find Extrema

Fitting

Isolate the \(\gamma\) Region

Attempt Something that Ultimately Does Not Work

Baseline Removal

All about that Base(line)

Conclusions

Background

Some Fake Data to Work With

Gather the Grain

Visualize

Separate the Wheat from the Chaff

I can’t see the difference. Can you see the difference?

Conclusion

Background

Reading in the Data

Clean Up

Check it Out

Write the Data

Conclusion

Background

Precision of Components

Build Approximation Functions

Random Simulation

Conclusion