C++ – The Lab-R-torian

Background

R dogma is that for loops are bad because they are slow but this is not the case in C++. I had never programmed a line of C++ as of last week but my beloved firstborn started university last week and is enrolled in a C++ intro course, so I thought I would try to learn some and see if it would speed up Passing Bablok regression.

Passing Bablok Regression

As mentioned in the past, the field of Clinical Chemistry has a peculiar devotion to Passing Bablok regression… and hey, why not?

Here is the code for a minimal implementation of Passing Bablok regression as discussed in this paper. This is the scale-invariant version.

PB.reg <- function(x,y){
  lx <- length(x)
  l <- lx*(lx - 1)/2
  k <- 0
  S <- rep(NA, lx)
  for (i in 1:(lx - 1)) {
    for (j in (i + 1):lx) {
            k <- k + 1
            S[k] <- (y[i] - y[j])/(x[i] - x[j])
    }
  }
  S.sort <- sort(S)
  N <- length(S.sort)
  neg <- length(subset(S.sort,S.sort < 0))
  K <- floor(neg/2)
  if (N %% 2 == 1) {
      b <- S.sort[(N+1)/2+K]
  } else {
      b <- sqrt(S.sort[N / 2 + K]*S.sort[N / 2 + K + 1])
  }
  a <- median(y - b * x)
  return(as.vector(c(a,b)))
}

PB.reg <- function(x,y){

lx <- length(x)

l <- lx*(lx - 1)/2

k <- 0

S <- rep(NA, lx)

for (i in 1:(lx - 1)) {

for (j in (i + 1):lx) {

k <- k + 1

S[k] <- (y[i] - y[j])/(x[i] - x[j])

}

S.sort <- sort(S)

N <- length(S.sort)

neg <- length(subset(S.sort,S.sort < 0))

K <- floor(neg/2)

if (N %% 2 == 1) {

b <- S.sort[(N+1)/2+K]

} else {

b <- sqrt(S.sort[N / 2 + K]*S.sort[N / 2 + K + 1])

}

a <- median(y - b * x)

return(as.vector(c(a,b)))

}

So let’s make some fake data and see what we get:

set.seed(314)
x <- seq(1, 100, length.out = 100)
y <- 1.1* x + 20 + x*rnorm(100,0,0.10)
reg <- PB.reg(x,y)
round(reg,2)

set.seed(314)

x <- seq(1, 100, length.out = 100)

y <- 1.1* x + 20 + x*rnorm(100,0,0.10)

reg <- PB.reg(x,y)

round(reg,2)

## [1] 18.90  1.14

1	## [1] 18.90 1.14

Just to sanity check, we can get the coefficients from the mcr() package.

library(mcr)
reg.mcr <- mcreg(x,y)
round(reg.mcr@glob.coef,2)

library(mcr)

reg.mcr <- mcreg(x,y)

round(reg.mcr@glob.coef,2)

## [1] 18.90  1.14

1	## [1] 18.90 1.14

Ok, looks good.

Rcpp

The Rcpp package permits compiling and execution of C++ code from within R. This can be good for computationally intensive tasks. Here is my child-like attempt at recapitulation of the R script above.

#include <iostream>
#include <vector>
#include <cmath>
#include <limits>
#include <algorithm>
#include <numeric>
#include "Rcpp.h"
using namespace std;
using namespace Rcpp;
// [[Rcpp::plugins("cpp11")]]
// [[Rcpp::export]]

vector <double> PB(vector<double> x,vector<double> y){
  int lx=x.size();
  int l=lx*(lx-1)/2;
  int k=-1;
  vector <double> S(l,numeric_limits<double>::quiet_NaN());
  vector <double> sortS;
  vector <double> neg;
  double b;
  double a;
  vector <double> aVec(x.size());
  vector <double> coef(2);

  // Calculate the components of the S Matrix
  for(int i=0; i<lx-1;i++){
    for(int j=i+1; j<lx;j++){
        k+=1;
        S[k]=(y[i]-y[j])/(x[i]-x[j]);
    }
  }

  // Sort the S Matrix
  sort(S.begin(),S.end());

  // Delete any undefined values from S
  for(vector <double> :: iterator it = S.begin(); it != S.end(); ++it){
      if (!isnan(*it)) sortS.push_back((*it));
  }

  // Put all the negative values into a vector called neg
  for(vector <double> :: iterator it = S.begin(); it != S.end(); ++it){
      if ((*it)<0) neg.push_back((*it));
  }

  // Calculate N and K
  int K=int(floor(float(neg.size()/2)));
  int N=sortS.size();

  // Calculate the slope
  if (N%2==1) {
    b=S[static_cast<int> ((N+1)/2+K-1)];
  } else {
    b=sqrt(S[static_cast<int> (N/2+K-1)]*S[static_cast<int> (N/2+K)]);
  }

  // Make a vector aVec of the estimates of the intercept
  for(int i=0; i<x.size(); i++){
      aVec[i]=y[i]-b*x[i];
  }
  sort(aVec.begin(),aVec.end());

  // Calculate the median from the sorted values of aVec
  if(x.size()%2==1){
    a=aVec[static_cast<int>((x.size()+1.0)/2.0-1.0)];
  }else{
    a=(aVec[static_cast<int> (x.size()/2.0-1.0)] + aVec[static_cast<int> (x.size()/2.0)])/2.0;
  }

  // Report results
  coef[0]=a;
  coef[1]=b;
  return coef;
}

int main(){
  return 0;
}

#include <iostream>

#include <vector>

#include <cmath>

#include <limits>

#include <algorithm>

#include <numeric>

#include "Rcpp.h"

using namespace std;

using namespace Rcpp;

// [[Rcpp::plugins("cpp11")]]

// [[Rcpp::export]]

vector <double> PB(vector<double> x,vector<double> y){

int lx=x.size();

int l=lx*(lx-1)/2;

int k=-1;

vector <double> S(l,numeric_limits<double>::quiet_NaN());

vector <double> sortS;

vector <double> neg;

double b;

double a;

vector <double> aVec(x.size());

vector <double> coef(2);

// Calculate the components of the S Matrix

for(int i=0; i<lx-1;i++){

for(int j=i+1; j<lx;j++){

k+=1;

S[k]=(y[i]-y[j])/(x[i]-x[j]);

}

// Sort the S Matrix

sort(S.begin(),S.end());

// Delete any undefined values from S

for(vector <double> :: iterator it = S.begin(); it != S.end(); ++it){

if (!isnan(*it)) sortS.push_back((*it));

}

// Put all the negative values into a vector called neg

for(vector <double> :: iterator it = S.begin(); it != S.end(); ++it){

if ((*it)<0) neg.push_back((*it));

}

// Calculate N and K

int K=int(floor(float(neg.size()/2)));

int N=sortS.size();

// Calculate the slope

if (N%2==1) {

b=S[static_cast<int> ((N+1)/2+K-1)];

} else {

b=sqrt(S[static_cast<int> (N/2+K-1)]*S[static_cast<int> (N/2+K)]);

}

// Make a vector aVec of the estimates of the intercept

for(int i=0; i<x.size(); i++){

aVec[i]=y[i]-b*x[i];

}

sort(aVec.begin(),aVec.end());

// Calculate the median from the sorted values of aVec

if(x.size()%2==1){

a=aVec[static_cast<int>((x.size()+1.0)/2.0-1.0)];

}else{

a=(aVec[static_cast<int> (x.size()/2.0-1.0)] + aVec[static_cast<int> (x.size()/2.0)])/2.0;

}

// Report results

coef[0]=a;

coef[1]=b;

return coef;

}

int main(){

return 0;

}

Saving this to a local directory as PB_Reg.cpp, we can compile it and call the function into the R session as follows:

library(Rcpp)
sourceCpp("PB_Reg.cpp")

1 2	library(Rcpp) sourceCpp("PB_Reg.cpp")

And run it:

round(PB(x,y),2)

1	round(PB(x,y),2)

## [1] 18.90  1.14

1	## [1] 18.90 1.14

Looks good! But is it faster? Like really faster? Let’s find out. And let’s also compare it to what the compiler() package produces:

library(compiler)
PB.reg.cmpfun <- cmpfun(PB.reg)

library(rbenchmark)
bmark <- benchmark(
  Rscript = PB.reg(x,y),
  `Compiler Package` = PB.reg.cmpfun(x,y),
  Rcpp = PB(x,y),
  replications = 100,
  order = "relative"
)
knitr::kable(bmark)

library(compiler)

PB.reg.cmpfun <- cmpfun(PB.reg)

library(rbenchmark)

bmark <- benchmark(

Rscript = PB.reg(x,y),

`Compiler Package` = PB.reg.cmpfun(x,y),

Rcpp = PB(x,y),

replications = 100,

order = "relative"

)

knitr::kable(bmark)

	test	replications	elapsed	relative	user.self	sys.self
3	Rcpp	100	0.024	1.000	0.024	0.000
2	Compiler Package	100	0.214	8.917	0.214	0.000
1	Rscript	100	0.217	9.042	0.213	0.004

So this makes C++ code about 9x faster than R source code. Not bad but in this case it was probably not worth the effort. Oh well. I learned something. Incidentally, the compiler() package did not help much in this case as it is only marginally better than raw R source. I have seen better results from compiler() in other cases.

Final thought

Speaking of speed:

“Free yourself, like a gazelle from the hand of the hunter, like a bird from the snare of the fowler.”

Proverbs 6:5

The Lab-R-torian

Tag: C++

Is the pain worth it?: Can Rcpp speed up Passing Bablok Regression?

Background

Passing Bablok Regression

Rcpp

Final thought