Is the pain worth it?: Can Rcpp speed up Passing Bablok Regression?

Background

R dogma is that for loops are bad because they are slow but this is not the case in C++. I had never programmed a line of C++ as of last week but my beloved firstborn started university last week and is enrolled in a C++ intro course, so I thought I would try to learn some and see if it would speed up Passing Bablok regression.

Passing Bablok Regression

As mentioned in the past, the field of Clinical Chemistry has a peculiar devotion to Passing Bablok regression… and hey, why not?

Here is the code for a minimal implementation of Passing Bablok regression as discussed in this paper. This is the scale-invariant version.

So let’s make some fake data and see what we get:

Just to sanity check, we can get the coefficients from the mcr() package.

Ok, looks good.

Rcpp

The Rcpp package permits compiling and execution of C++ code from within R. This can be good for computationally intensive tasks. Here is my child-like attempt at recapitulation of the R script above.

Saving this to a local directory as PB_Reg.cpp, we can compile it and call the function into the R session as follows:

And run it:

Looks good! But is it faster? Like really faster? Let’s find out. And let’s also compare it to what the compiler() package produces:

test replications elapsed relative user.self sys.self user.child sys.child
3 Rcpp 100 0.024 1.000 0.024 0.000 0 0
2 Compiler Package 100 0.214 8.917 0.214 0.000 0 0
1 Rscript 100 0.217 9.042 0.213 0.004 0 0

So this makes C++ code about 9x faster than R source code. Not bad but in this case it was probably not worth the effort. Oh well. I learned something. Incidentally, the compiler() package did not help much in this case as it is only marginally better than raw R source. I have seen better results from compiler() in other cases.

Final thought

Speaking of speed:

“Free yourself, like a gazelle from the hand of the hunter, like a bird from the snare of the fowler.”

Proverbs 6:5


Non-Linear Regression: Application to Monoclonal Peak Integration in Serum Protein Electrophoresis

Background

At the AACC meeting recently, there was an enthusiastic discussion of standardization of reporting for serum protein electrophoresis (SPEP) presented by a working group headed up by Dr. Chris McCudden and Dr. Ron Booth, both of the University of Ottawa. One of the discussions pertained to how monoclonal bands, especially small ones, should be integrated. While many use the default manual vertical gating or “drop” method offered by Sebia's Phoresis software, Dr. David Keren was discussing the value of tangent skimming as a more repeatable and effective means of monoclonal protein quantitation. He was also discussing some biochemical approaches distinguishing monoclonal proteins from the background gamma proteins.

The drop method is essentially an eye-ball approach to where the peak starts and ends and is represented by the vertical lines and the enclosed shaded area.

plot of chunk unnamed-chunk-1

The tangent skimming approach is easier to make reproducible. In the mass spectrometry world it is a well-developed approach with a long history and multiple algorithms in use. This is apparently the book. However, when tangent skimming is employed in SPEP, unless I am mistaken, it seems to be done by eye. The integration would look like this:

plot of chunk unnamed-chunk-2

During the discussion it was point out that peak deconvolution of the monoclonal protein from the background gamma might be preferable to either of the two described procedures. By this I mean integration as follows:

plot of chunk unnamed-chunk-3

There was discussion this procedure is challenging for number of reasons. Further, it should be noted that there will only likely be any clinical value in a deconvolution approach when the concentration of the monoclonal protein is low enough that manual integration will show poor repeatability, say < 5 g/L = 0.5 g/dL.

Easy Peaks

Fitting samples with larger monoclonal peaks is fairly easy. Fitting tends to converge nicely and produce something meaningful. For example, using the approach I am about to show below, an electropherogram like this:

plot of chunk unnamed-chunk-4

with a gamma region looking like this:

plot of chunk unnamed-chunk-5

can be deconvoluted with straightforward non-linear regression (and no baseline subtraction) to yield this:

plot of chunk unnamed-chunk-6

and the area of the green monoclonal peak is found to be 5.3%.

More Difficult Peaks

What is more challenging is the problem of small monoclonals buried in normal \(\gamma\)-globulins. These could be difficult to integrate using a tangent skimming approach, particularly without image magnification. For the remainder of this post we will use a gel with a small monoclonal in the fast gamma region shown at the arrow.

plot of chunk unnamed-chunk-7

Getting the Data

EP data can be extracted from the PDF output from any electrophoresis software. This is not complicated and can be accomplished with pdf2svg or Inkscape and some Linux bash scripting. I'm sure we can get it straight from the instrument but it is not obvious to me how to do this. One could also rescan a gel and use ImageJ to produce a densitometry scan which is discussed in the ImageJ documentation and on YouTube. ImageJ also has a macro language for situations where the same kind of processing is done repeatedly.

Smoothing

The data has 10284 pairs of (x,y) data. But if you blow up on it and look carefully you find that it is a series of staircases.

plot of chunk unnamed-chunk-8

It turns out that this jaggedness significantly impairs attempts to numerically identify the peaks and valleys. So, I smoothed it a little using the handy rle() function to identify the midpoint of each step. This keeps the total area as close to its original value as possible–though this probably does not matter too much.

plot of chunk unnamed-chunk-9

Now that we are satisfied that the new data is OK, I will overwrite the original dataframe.

Transformation

The units on the x and y-axes are arbitrary and come from page coordinates of the PDF. We can normalize the scan by making the x-axis go from 0 to 1 and by making the total area 1.

plot of chunk unnamed-chunk-11

Find Extrema

Using the findPeaks function from the quantmod package we can find the minima and maxima:

plot of chunk unnamed-chunk-12

plot of chunk unnamed-chunk-12

Not surprisingly, there are some extraneous local extrema that we do not want. I simply manually removed them. Generally, this kind of thing could be tackled with more smoothing of the data prior to analysis.

Fitting

Now it's possible with the nls() function to fit the entire SPEP with a series of Gaussian curves simultaneously. It works just fine (provided you have decent initial estimates of \(\mu_i\) and \(\sigma_i\)) but there is no particular clinical value to fitting the albumin, \(\alpha_1\), \(\alpha_2\), \(\beta_1\) and \(\beta_2\) domains with Gaussians. What is of interest is separately quantifying the two peaks in \(\gamma\) with two separate Gaussians so let's isolate the \(\gamma\) region based on the location of the minimum between \(\beta_2\) and \(\gamma\).

Isolate the \(\gamma\) Region

plot of chunk unnamed-chunk-14

Attempt Something that Ultimately Does Not Work

At first I thought I could just throw two normal distributions at this and it would work. However, it does not work well at all and this kind of not-so-helpful fit turns out to happen a fair bit. I use the nls() function here which is easy to call. It requires a functional form which I set to be:

\[y = C_1 \exp\Big(-{\frac{(x-\mu_1)^2}{2\sigma_1^2}}\Big) + C_2 \exp \Big({-\frac{(x-\mu_2)^2}{2\sigma_2^2}}\Big)\]

where \(\mu_1\) is the \(x\) location of the first peak in \(\gamma\) and \(\mu_2\) is the \(x\) location of the second peak in \(\gamma\). The estimates of \(\sigma_1\) and \(\sigma_2\) can be obtained by trying to estimate the full-width-half-maximum (FWHM) of the peaks, which is related to \(\sigma\) by

\[FWHM_i = 2 \sqrt{2\ln2} \times \sigma_i = 2.355 \times \sigma_i\]

I had to first make a little function that returns the respective half-widths at half-maximum and then uses them to estimate the \(FWHM\). Because the peaks are poorly resolved, it also tries to get the smallest possible estimate returning this as FWHM2.

The peak in the \(\gamma\) region was obtained previously:

plot of chunk unnamed-chunk-16

and from them \(\mu_1\) is determined to be 0.7. We have to guess where the second peak is, which is at about \(x=0.75\) and has an index of 252 in the gamma.data dataframe.

plot of chunk unnamed-chunk-17

Now we can find the estimates of the standard deviations:

The estimates of \(\sigma_1\) and \(\sigma_2\) are now obtained. The estimates of \(C_1\) and \(C_2\) are just the peak heights.

We can now use nls() to determine the fit.

Determining the fitted values of our unknown coefficients:

And now we can plot the fitted results against the original results:

plot of chunk unnamed-chunk-22

And this is garbage. The green curve is supposed to be the monoclonal peak, the blue curve is supposed to be the \(\gamma\) background, and the red curve is their sum, the overall fit. This is a horrible failure.

Subsequently, I tried fixing the locations of \(\mu_1\) and \(\mu_2\) but this also yielded similar nonsensical fitting. So, with a lot of messing around trying different functions like the lognormal distribution, the Bi-Gaussian distribution and the Exponentially Modified Gaussian distribution, and applying various arbitrary weighting functions, and simultaneously fitting the other regions of the SPEP, I concluded that nothing could predictably produce results that represented the clinical reality.

I thought maybe the challenge to obtain a reasonable fit related to the sloping baseline, so I though I would try to remove it. I will model the baseline in the most simplistic manner possible: as a sloped line.

Baseline Removal

I will arbitrarily define the tail of the \(\gamma\) region to be those values having \(y \leq 0.02\). Then I will connect the first (x,y) point from the \(\gamma\) region and connect it to the tail.

plot of chunk unnamed-chunk-24

Now we can define a new dataframe gamma.no.base that has the baseline removed:

plot of chunk unnamed-chunk-25

The black is the original \(\gamma\) and the dashed has the baseline removed. This becomes and easy fit.

plot of chunk unnamed-chunk-26

Lo and behold…something that is not completely insane. The green is the monoclonal, the blue is the \(\gamma\) background and the red is their sum, that is, the overall fit. A better fit could now we sought with weighting or with a more flexible distribution shape. In any case, the green peak is now easily determined. Since

\[\int_{-\infty}^{\infty} C_1 \exp\Big(-{\frac{(x-\mu_1)^2}{2\sigma_1^2}}\Big)dx = \sqrt{2\pi}\sigma C_1\]

So this peak is 2.4% of the total area. Now, of course, this assumes that nothing under the baseline is attributable to the monoclonal peak and all belongs to normal \(\gamma\)-globulins, which is very unlikely to be true. However, the drop and tangent skimming methods also make assumptions about how the area under the curve contributes to the monoclonal protein. The point is to try to do something that will produce consistent results that can be followed over time. Obviously, if you thought there were three peaks in the \(\gamma\)-region, you'd have to set up your model accordingly.

All about that Base(line)

There are obviously better ways to model the baseline because this approach of a linear baseline is not going to work in situations where, for example, there is a small monoclonal in fast \(\gamma\) dwarfed by normal \(\gamma\)-globulins. That is, like this:

plot of chunk unnamed-chunk-28

Something curvilinear or piecewise continuous and flexible enough for more circumstances is generally required.

There is also no guarantee that baseline removal, whatever the approach, is going to be a good solution in other circumstances. Given the diversity of monoclonal peak locations, sizes and shapes, I suspect one would need a few different approaches for different circumstances.

Conclusions

  • The data in the PDFs generated by EP software are processed (probably with splining or similar) followed by the stair-stepping seen above. It would be better to work with raw data from the scanner.

  • Integrating monoclonal peaks under the \(\gamma\) baseline (or \(\beta\)) is unlikely to be a one-size-fits all approach and may require application of a number of strategies to get meaningful results.

    • Basline removal might be helpful at times.
  • Peak integration will require human adjudication.

  • While most monoclonal peaks show little skewing, better fitting is likely to be obtained with distributions that afford some skewing.

  • MASSFIX may soon make this entire discussion irrelevant.

Parting Thought

On the matter of fitting

In bringing many sons and daughters to glory, it was fitting that God, for whom and through whom everything exists, should make the pioneer of their salvation perfect through what he suffered.

Heb 2:10

A Shiny App for Passing Bablok and Deming Regression

Background

Back in 2011 I was not aware of any tool in R for Passing Bablok (PB) regression, a form of robust regression described in a series of three papers in Clinical Chemistry and Laboratory Medicine (then J Clin Chem and Biochem) available here, here and here. For reasons that are not entirely clear to me, this regression methodology is favoured by clinical chemists but seems largely ignored by other disciplines. However since reviewers clinical chemistry journals will demand the use of PB regression, it seemed expeditious to me to code it in R. This is what spawned a small project for a piece of software to do PB (and Deming and ordinary least squares) regression using a self-contained executable that could be downloaded, unzipped on a Windows Desktop and just ran. You can download here and instructions for installation and use are here and here respectively. The calculations are all done in R, the GUI is built with Python and Py-Qt4 and the executable with cx_freeze. I made it run without an installer because hospital IT often refuse to install software that has not been officially vetted and purchased. The tool was a lot more popular than I anticipated now having about 2000 downloads. In any case, maintenance, upgrades, bug fixing and dealing with operating system updates that break things (like OSX El Capitan's security policies) are no-fun so a Shiny based solution to the same problem makes a lot of sense.

Update

Since 2011, statisticians at Roche Diagnostics programmed the mcr package for PB and Deming regression. Additionally, there is also the MethComp package and the deming package from the Mayo Clinic which both offer PB regression.

Shiny App

Enter Burak Bahar, a like-minded Clinical Pathologist who is currently doing a fellowship at Yale. He liked my cp-R program but he saw the need for a web-based equivalent.

Burak and his wife Ayse, also a physician, have coded a Shiny App for doing Deming, PB and least squares regression in R which is capable of producing publication quality figures and provides all the regression statistics you would need for method-validation or publication. It can also produce a regression report in PDF, Word or HTML. The dynamic duo of the Bahar-MDs deserve all credit here as my only contribution related to suggestions related to usability. This project was presented at the 2016 American Association of Clinical Chemistry meeting in Philadelphia.

The app URL is bahar.shinyapps.io/method_compare. Go to the data tab on the left and then cut and paste your data from an spreadsheet program. Shortcuts CTRL-C (copy) and CTRL-V (paste) work natively in the table. The table is pre-populated with some random data for demonstration purposes. Once your data is pasted in, click on the Plots tab and choose the Bland-Altman or Scatter Plot.

Example

Here is an image generated with the Bahar Shiny app using method comparison data obtained from St. Paul's Hospital Laboratory in migrating from Siemens Immulite 2000 XPi to Roche Cobas e601 for Calcitonin determination. Don't worry, we did more than 33 comparison–I am just showing the low end.

regression

BA

Try adjusting some of the plot parameters. The figures will update in real time. Thanks to Burak and Ayse Bahar for your work!

(Dan's) Parting Thought

There are straight lines that matter a lot more than regression.

I will make justice the measuring line and righteousness the plumb line
(Isa 28:17)

Deming and Passing Bablok Regression in R

Regression Methods

In this post we will be discussing how to perform Passing Bablok and Deming regression in R. Those who work in Clinical Chemistry know that these two approaches are required by the journals in the field. The idiosyncratic affection for these two forms of regression appears to be historical but this is something unlikely to change in my lifetime–hence the need to cover it here.

Along the way, we shall touch on the ways in which Deming and Passing Bablok differ from ordinary least squares (OLS) and from one another.

Creating some random data

Let's start by making some heteroscedastic random data that we can use for regression. We will use the command set.seed() to begin with because by this means, the reader can generate the same random data as the post. This function takes any number you wish as its argument, but if you set the same seed, you will get the same random numbers. We will generate 100 random \(x\) values in the uniform distribution and then an accompanying 100 random \(y\) values with proportional bias, constant bias and random noise that increases with \(x\). I have added a bit of non–linearity because we do see this a fair bit in our work.

The constants I chose are arbitrary. I chose them to produce something resembling a comparison of, say, two automated immunoassays.

Let's quickly produce a scatter plot to see what our data looks like:

plot of chunk unnamed-chunk-2

Residuals in OLS

OLS regression minimizes the sum of squared residuals. In the case of OLS, the residual of a point is defined as the vertical distance from that point to the regression line. The regression line is chosen so that the sum of the squares of the residuals in minimal.

OLS regression assumes that there is no error in the \(x\)–axis values and that there is no heteroscedasticity, that is, the scatter of \(y\) is constant. Neither of these assumptions is true in the case of bioanaytical method comparisons. In contrast, for calibration curves in mass–spectrometry, a linear response is plotted as a function of pre–defined calibrator concentration. This means that the \(x\)–axis has very little error and so OLS regression is an appropriate choice (though I doubt that the assumption about homoscedasticity is generally met).

OLS is part of R's base package. We can find the OLS regression line using lm() and we will store the results in the variable lin.reg.

plot of chunk unnamed-chunk-3

Just to demonstrate the point about residuals graphically, the following shows them in vertical red lines.

plot of chunk unnamed-chunk-4

Deming Regression

Deming regression differs from OLS regression in that it does not make the assumption that the \(x\) values are free of error. It (more or less) defines the residual as the perpendicular distance from a point to its fitted value on the regression line.

Deming regression does not come as part of R's base package but can be performed using the MethComp and mcr packages. In this case, we will use the latter. If not already installed, you must install the mcr package with install.packages("mcr").

Then to perform Deming regression, we will load the mcr library and execute the following using the mcreg() command, storing the output in the variable dem.reg.

By performing the str() command on dem.reg, we can see that the regression parameters are stored in the slot @para. Because the authors have used an S4 object as the output of their function, we don't address output as we would in lists (with a $), but rather with an @.

The intercept and slope are stored in demreg@para[1] and dem.reg@para[2] respectively. Therefore, we can add the regression line as follows:

plot of chunk unnamed-chunk-7

To emphasize how the residuals are different from OLS we can plot them as before:

plot of chunk unnamed-chunk-8

We present the figure above for instructional purposes only. The usual way to present a residuals plot is to show the same picture rotated until the line is horizontal–this is a slight simplification but is essentially what is happening:

plot of chunk unnamed-chunk-9

Ratio of Variances

It is important to mention that if one knows that the \(x\)–axis method is subject to a different amount of random analytical variability than the \(y\)–axis method, one should provide the ratio of the variances of the two methods to mcreg(). In general, this requires us to have “CV” data from precision studies already available. Another approach is to perform every analysis in duplicate by both methods and use the data to estimate this ratio.

If the methods happen to have similar CVs throughout the analytical range, the default value of 1 is assumed. But suppose that the ratio of the CVs of the \(x\) axis method to the \(y\)–axis method was 1.2, we could provide this in the regression call by setting the error.ratio parameter. The resulting regression parameters will be slightly different.

Weighting

In the case of heteroscedastic data, it would be customary to weight the regression which in the case of the mcr package is weighted as \(1/x^2\). This means that having 0's in your \(x\)–data will cause the calculation to “crump”. In any case, if we wanted weighted regression parameters we would make the call:

And plotting both on the same figure:

plot of chunk unnamed-chunk-12

Passing Bablok

Passing Bablok regression is not performed by the minimization of residuals. Rather, all possible pairs of \(x\)–\(y\) points are determined and slopes are calculated using each pair of points. Work–arounds are undertaken for pairs of points that generate infinite slopes and other peculiarities. In any case, the median of the \(\frac{N(N-1)}{2!}\) possible slopes becomes the final slope estimate and the corresponding intercept can be calculated. With regards to weighted Passing Bablok regression, I’d like to acknowledge commenter glen_b for bringing to my attention that there is a paradigm for calculating the weighted median of pairwise slopes. See the comment section for a discussion.

Passing Bablok regression takes a lot of computational time as the number of points grows, so expect some delays on data sets larger than \(N=100\) if you are using an ordinary computer. To get the Passing Bablok regression equation, we just change the method.reg parameter:

and the procedures to plot this regression are identical. The mcreg() function does have an option for Passing Bablok regression on large data sets. See the instructions by typing help("mcreg") in the R terminal.

Outlier Effects

As a consequence of the means by which the slope is determined, the Passing Bablok method is relatively resistant to the effect of outlier(s) as compared to OLS and Deming. To demonstrate this, we can add on outlier to some data scattered about the line \(y=x\) and show how all three methods are affected.

plot of chunk unnamed-chunk-15

Because of this outlier, the OLS slope drops to 0.84, the Deming slope to 0.91, while the Passing Bablok is much better off at 0.99.

Generating a Pretty Plot

The code authors of the mcr package have created a feature such that if you put the regression model inside the plot function, you can quickly generate a figure for yourself that has all the required information on it. For example,

plot of chunk unnamed-chunk-16

But this method of out–of–the–box figure is not very customizable and you may want it to appear differently for your publication. Never fear. There is a solution. The MCResult.plot() function offers complete customization of the figure so that you can show it exactly as you wish for your publication.

custom mcr plot

In this example, I have created semi–transparent “darkorchid4” (hex = #68228B) points and a semi–transparent blue (hex = #0000FF) confidence band of the regression. Maybe darkorchid would not be my first choice for a publication after all, but it demonstrates the customization. Additionally, I have suppressed my least favourite features of the default plot method. Specifically, the sub="" term removes the sentence at the bottom margin and the add.grid = FALSE prevents the grid from being plotted. Enter help(MCResult.plot) for the complete low–down on customization.

Conclusion

We have seen how to perform Deming and Passing Bablok regression in the R programming language and have touched on how the methods differ “under the hood”. We have used the mcr to perform the regressions and have shown how you can beautify your plot.

The reader should have a look at the rlm() function in the MASS package and the rq() function in the quantreg package to see other robust (outlier–resistant) regression approaches. A good tutorial can be found here

I hope that makes it easy for you.

-Dan

May all your paths (and regressions) be straight:

Trust in the Lord with all your heart
and lean not on your own understanding;
in all your ways submit to him,
and he will make your paths straight.

Proverbs 3:5-6