Tag Archives: clinical pathology

Parse an Online Table into an R Dataframe – Westgard’s Biological Variation Database


From time to time I have wanted to bring an online table into an R dataframe. While in principle, the data can be cut and paste into Excel, sometimes the table is very large and sometimes the columns get goofed up in the process. Fortunately, there are a number of R tools for accomplishing this. I am just going to show one approach using the rvest package. The rvest package also makes it possible to interact with forms on webpages to request specific material which can then be scraped. I think you will see the potential if you look here.

In our (simple) case, we will apply this process to Westgard's desirable assay specifications as shown on his website. The goal is to parse out the biological variation tables, get them into a dataframe and the write to csv or xlsx.

Reading in the Data

The first thing to do is to load the rvest and httr packages and define an html session with the html_session() function.

Now looking at the webpage, you can see that there are 8 columns in the tables of interest. So, we will define an empty dataframe with 8 columns.

We need to know which part of the document to scrape. This is a little obscure, but following the instructions in this post, we can determine that the xpaths we need are:





There are 8 such tables in the whole webpage. We can define a character vector for these as such:

Now we make a loop to scrape the 8 tables and with each iteration of the loop, append the scraped subtable to the main dataframe called biotable using the rbind() function. We have to use the parameter fill = TRUE in the html_table() function because the table does not happen to always a uniform number of columns.

Clean Up

Now that we have the raw data out, we can have a quick look at it:

X1 X2 X3 X4 X5 X6 X7 X8
Analyte Number of Papers Biological Variation Biological Variation Desirable specification Desirable specification Desirable specification
Analyte Number of Papers CVI CVg I(%) B(%) TE(%)
S- 11-Desoxycortisol 2 21.3 31.5 10.7 9.5 27.1
S- 17-Hydroxyprogesterone 2 19.6 50.4 9.8 13.5 29.7
U- 4-hydroxy-3-methoximandelate (VMA) 1 22.2 47.0 11.1 13.0 31.3
S- 5' Nucleotidase 2 23.2 19.9 11.6 7.6 26.8
U- 5'-Hydroxyindolacetate, concentration 1 20.3 33.2 10.2 9.7 26.5
S- α1-Acid Glycoprotein 3 11.3 24.9 5.7 6.8 16.2
S- α1-Antichymotrypsin 1 13.5 18.3 6.8 5.7 16.8
S- α1-Antitrypsin 3 5.9 16.3 3.0 4.3 9.2

We can see that we need define column names and we need to get rid of some rows containing extraneous column header information. There are actually 8 such sets of headers to remove.

Let's now find rows we don't want and remove them.

You will find that the table has missing data which is written as “- – -”. This should be now replaced by NA and the column names should be assigned to sequential integers. Also, we will remove all the minus signs after the specimen type. I'm not sure what they add.

Check it Out

Just having another look at the first 10 rows:

Sample Analyte NumPapers CVI CVG I B TE
S 11-Desoxycortisol 2 21.3 31.5 10.7 9.5 27.1
S 17-Hydroxyprogesterone 2 19.6 50.4 9.8 13.5 29.7
U 4-hydroxy-3-methoximandelate (VMA) 1 22.2 47.0 11.1 13.0 31.3
S 5' Nucleotidase 2 23.2 19.9 11.6 7.6 26.8
U 5'-Hydroxyindolacetate, concentration 1 20.3 33.2 10.2 9.7 26.5
S α1-Acid Glycoprotein 3 11.3 24.9 5.7 6.8 16.2
S α1-Antichymotrypsin 1 13.5 18.3 6.8 5.7 16.8
S α1-Antitrypsin 3 5.9 16.3 3.0 4.3 9.2
S α1-Globulins 2 11.4 22.6 5.7 6.3 15.7
U α1-Microglobulin, concentration, first morning 1 33.0 58.0 16.5 16.7 43.9

Now examining the structure:

It's kind-of undesirable to have numbers as characters so…

Write the Data

Using the xlsx package, you can output the table to an Excel file in the current working directory.

If you are having trouble getting xlsx to install, then just write as csv.


You can now use the same general approach to parse any table you have web access to, no mater how small or big it is. Here is a complete script in one place:

Parting Thought on Tables

You prepare a table before me in the presence of my enemies. You anoint my head with oil; my cup overflows.

(Psalm 23:5)

Please follow and like us:

Determine the CV of a Calculated Lab Reportable – Bioavailable Testosterone


At the AACC meeting last week, some of my friends were bugging me that I had not made a blog post in 10 months. Without getting into it too much, let's just say I can blame Cerner. Thanks also to a prod from a friend, here is an approach to a fairly common problem.

We all report calculated quantities out of our laboratories–quantities such as LDL cholesterol, non-HDL cholesterol, aldosterone:renin ratio, free testosterone, eGFR etc. How does one determine the precision (i.e. imprecision) of a calculated quantity. While earlier in my life, I might go to the trouble of trying to do such calculations analytically using the rules of error propagation, in my later years, I am more pragmatic and I'm happy to use a computational approach.

In this example, we will model the precision in calculated bioavailable testosterone (CBAT). Without explanation, I provide an R function for CBAT (and free testosterone) where testosterone is reported in nmol/L, sex hormone binding globulin (SHBG) is reported in nmol/L, and albumin is reported in g/L. Using the Vermeulen Equation as discussed in this publication, you can calculate CBAT as follows:

To sanity-check this, we can use this online calculator. Taking a typical male testosterone of 20 nmol/L, an SHBG of 50 nmol/L and an albumin of 43 g/L, we get the following:

which is confirmed by the online calculator. Because the function is vectorized, we an submit a vector of testosterone results and SHBG results and get a vector of CBAT results.

Precision of Components

We now need some precision data for the three components. However, in our lab, we just substitute 43 g/L for the albumin, so we will leave that term out of the analysis and limit our precision calculation to testosterone and SHBG. This will allow us to present the precision as surface plots as a function of total testosterone and SHBG.

We do testosterone by LC-MS/MS using Deborah French's method. In the last three months, the precision has been 3.9% at 0.78 nmol/L, 5.5% at 6.7 nmol/L, 5.2% at 18.0 nmol/L, and 6.0% at 28.2 nmol/L. We are using the Roche Cobas e601 SHBG method which, according to the package insert, has precision of 1.8% at 14.9 nmol/L, 2.1 % at 45.7 nmol/L, and 4.0% at 219 nmol/L.

plot of chunk unnamed-chunk-4

plot of chunk unnamed-chunk-4

Build Approximation Functions

We will want to generate linear interpolations of these precision profiles. Generally, we might watnt to use non-linear regression to do this but I will just linearly interpolate with the approxfun() function. This will allow us to just call a function to get the approximate CV at concentrations other than those for which we have data.

Now, if we want to know the precision of SHBG at, say, 100 nmol/L, we can just write,

to obtain our precision result.

Random Simulation

Now let's build a grid of SHBG and total testosterone (TT) values at which we will calculate the precision for CBAT.

At each point on the grid, we will have to generate, say, 100000 random TT values and 100000 random SHBG values with the appropriate precision and then calculate the expected precision of CBAT at those concentrations.

Let's do this for a single pair of concentrations by way of example modelling the random analytical error as Gaussian using the rnorm() function.

So, we can build the process of calculating the CV of CBAT into a function as follows:

Now, we can make a matrix of the data for presenting a plot, calculating the CV and appending it to the dataframe.

Now make plot using the wireframe() function.

plot of chunk unnamed-chunk-11

This shows us that the CV of CBAT ranges from about 4–8% over the TT and SHBG ranges we have looked at.


We have determined the CV of calculated bioavailable testosterone using random number simulations using empirical CV data and produced a surface plot of CV. This allows us to comment on the CV of this lab reportable as a function of the two variables by which it is determined.

Parting Thought on Monte Carlo Simulations

The die is cast into the lap, but its every decision is from the LORD.

(Prov 16:33)

Please follow and like us:

Conditional Formatting of a Table in R


There are a few ways to approach the problem of a conditionally formatted table in R. You can use the ReporteRs package's FlexTable() function, the formattable package, or the condformat package. These allow you to produce a conditionally formatted tables in HTML. You can also use xtable package and essentially program what you want in LaTeX via the xtable() function.

In my desire for something simple-ish, I am going do this graphically using the image() function as suggested here. The benefit is that I can then push the table into an RMarkdown generated PDF document easily.

The Problem

Suppose that you want to prepare a summary of how resident and medical student orders are placed on various wards. You obtain data that is formatted in the following manner.

There are 4 wards: medicine, surgery, ER and orthopedics. Orders can come in as computerized physician order entry (CPOE), verbal or written. The orders have to be cosigned by staff and this is recorded as TRUE/FALSE because staff are not always compliant in logging on to the EMR to cosign the trainee orders.

Preparing Proportions Table

Let's start with the assumption that we want to apply the same conditional formatting to all data in the table. That is, we want to color code all results with the same algorithm. We can used the image() function to get this done. Let's display the rates at which different order types (CPOE, verbal,or written) from the four wards. We can generate the proportions table in percent very easily with the prop.table() and table() functions operating on the first two columns of our orders data:

A DIY Approach with the Image Function

The image() function produces a tile plot based on matrix of z values, where z = f(x,y) using colours we can define and thresholds for switching from one colour to the next based on a breaks parameter. In our case, we will say that if the result is less than equal to 25%, we will colour the tile blue, if it is greater than 25% but less than or equal to 50%, we will colour it red, and if it greater than 50%, it will be yellow.

You will note that we have to transpose the data with the t() function because the image function plots the rows on the x axis on the columns on the y axis. You will also notice that we need to plot y descending on the y-axis to account for the fact that our tabular data has increasing index going down but the tile plot will default to have increasing y going up. We can also need to suppress the axes and their labels. The reader can comment out the lines xaxt = 'n' and yaxt = 'n' to see what is going on in terms of x and y values.

plot of chunk unnamed-chunk-5

Now we can write our values over top with the text() function.

plot of chunk unnamed-chunk-7

And then we can write the variable names (which we yank from the attributes of the table) into the figure margin and draw some lines to make it look pretty. It was necessary to use the adj and padj parameters to make it look a little cleaner.

plot of chunk unnamed-chunk-9

Conditionally Coloured Text

Now, if you want to make the text colour match the background colour, we will need a little function.

and then apply it over the values of the matrix:

plot of chunk unnamed-chunk-12

Different Conditions for Different Columns

Now suppose you wanted different conditional formatting for each column. This is kind of a pain because you will need to provide the image() function a matrix to generate an appropriate fill-colour and a different matrix for the data to be written in each cell. Let's imagine for example that we want to include the compliance rate for co-signing in a fourth column and this is the only column we want coloured. To this column we want a colour scheme applied wherein if compliance is less than or equal to 20%, the colour is red, between 20% and 80%, it is yellow, and above 80% it is green.

We can calculate a proportions table based on columns 1 and 3 of the orders dataframe and then we can define a matrix fill.data that has NA on all the rates we calculated above.

Now the proportions matrix is as follows:

and the fill data is:

Now we can apply the image() function to the fill.data matrix. When it comes to writing the data in the cells, we will use the original my.data matrix and we will adjust out color.picker() function.

plot of chunk unnamed-chunk-16

So, it looks like this could become super–awkward if we had elaborate conditions to apply. This is where a packages like condformat and formattable come in handy. If you use the condformat package, you can include the table in an RMarkdown generated PDF or HTML document. However, the formattable() function, though capable of much prettier output, does not work with PDFs generated using RMarkdown.

First, here is a condformat example. Suppose we wanted to colourized CPOE in shades of green because CPOE is more operationally desirable and verbal/written orders in shades of red because they are less operationally desirable. We also want the red/yellow/green formatting in the Cosigned column. Using condformat we could do the following:

CPOE Verbal Written Cosigned
1 49.3 8.7 42.0 75.3
2 30.0 4.0 66.0 52.0
3 89.5 7.0 3.5 88.5
4 8.0 23.0 69.0 13.0

You can see that the rownames are suppressed with condformat(). You could circumvent this by putting the rownames into their own column. This package is pretty easy to use and with PDF rendering (shown below) it produces something more LaTeX-ish than what is shown above which was generated straight to HTML.

plot of chunk unnamed-chunk-18

For something more attractive looking, here is an example of something similar using the formattable package (borrowing heavily from the code author's examples ):

CPOE Verbal Written Cosigned
Med 49.3 8.7 42.0 75.30 (rank: 02)
Surg 30.0 4.0 66.0 52.00 (rank: 03)
ER 89.5 7.0 3.5 88.50 (rank: 01)
Orth 8.0 23.0 69.0 13.00 (rank: 04)

I hope that this points you in the right direction.

And as for conditions:

“If you declare with your mouth, “Jesus is Lord,” and believe in your heart that God raised him from the dead, you will be saved.”

Romans 10:9

Please follow and like us:

Make Easy Heatmaps to Visualize your Turnaround Times

The Problem

In two previous posts, I discussed visualizing your turnaround times (TATs). These posts are here and here. One other nice way to visualize your TAT is by means of a heatmap. In particular, we would like to look at the TAT for every hour of the week in a single figure. This manner of dataviz bling seems to be particularly attractive to managers because it costs you $0 to do this with R, but with commercial tools like Tableau, you'd have to pay a fortune and, as with Excel, your report would not be readily reproducible. Further, to make it autogenerate a PDF would mean you had to fork out more money for a report-generation module. Pffft.

The Data

We're going to read in a year's worth of order times and result times for a stat immunoassay test offered to a particular ward. The data, as I've formatted it, has two columns, ord and res.

Now, of course, we want to look at data collected from a long period of time so that we can be sure that the observations we are not simply an artifact of recent instrument downtime, maintenance, or whoever happened to be running the instrument. This is why I chose a year's worth of data. We are going to visualize the median order-to-file TAT for this test.

Formatting and Calculations

To calculate the hourly medians, we'll need to be able to label every TAT with the day it was run and the hour in the day it was run. This is pretty easy with the lubridate package. We'll do three things:

  • We'll convert the dates to POSIXct objects
  • We'll use the difftime() function to calculate the TATs
  • We'll use the wday() function to determine which day of the week the specimen was run on
  • We'll pull out the hour of the day on which it was run with the format() function.

And now the data will look like this:

where the order-to-file TAT is in the otf column, the day-of-week is in the dow column and the hour-of-day is in the hod column. Now we can cycle though the days of the week and the hours of the day and calculate the year's median TAT for each hour, storing it in a matrix:

Making the Heatmap

There are many ways to make the heatmap but I am particularly fond of the appearance of surface plots made with the fields package.

plot of chunk unnamed-chunk-5

Overlay Printed Times

We can see that there is a morning slowdown that is particularly bad on Saturday. But what if we wanted to know the exact value for these eye-catching problem times? We'd have trouble, unless we overlaid some text.

It turns out that if you use white printing, you can't read the numbers when the background colour is yellow and green. There is a 64 colour gradient used in the image.plot() function, so I calculated which integers in 0–64 were the problem and found the TATs that would correspond. It turned out that colours 20–45 out of the 64 colours in the gradient are the problem. By this means, I can make the printing black over the yellows and greens but white everywhere else:

plot of chunk unnamed-chunk-6

So, that is not too bad, and if you wanted to look at the 75th percentile instead you would only have to adjust the heat.data calculation as follows:

And this is what you will get.

plot of chunk unnamed-chunk-8

Hmmm…we'd better look at Saturday morning, 6 am. I hope you have found this helpful.

And as for heat

“He will sit as a refiner and purifier of silver”

Malachi 3:3

Please follow and like us:

Make Bland Altman Plots with Marginal Histograms using ggExtra

The Problem

As you know in Clinical Chemistry, we are not always writing a major paper but sometimes just preparing a short-report to answer a technical question that we've encounted at work. For shorter papers, journals often have more stringent rules about how many figures you can submit and even sometimes forbid multipanelled figures. In these situations, we might want to cram a little more into your figure than we might otherwise. In a recent submission, I wanted to produce a difference plot of immunoassay results before and after storage but I also wanted to show the distribution of the results using a histogram–but this would have counted as two separate figures.

However, thanks to some fine work by Dean Attali of UBC Department of Statistics where he works with R-legend Jenny Bryan, it is quite easy to add marginal histograms to a Bland Altman (or any other scatter) plot using the ggExtra package.

How To

Let's make some fake data for a Bland Altman plot. Let's pretend that we are measuring the same quantity by immunoassay at baseline and after 1 year of storage at -80 degrees. We'll add some heteroscedastic error and create some apparent degradation of about 20%:

plot of chunk unnamed-chunk-1

Or if we plot this in the ggplot() paradigm

plot of chunk unnamed-chunk-2

Now we will prepare the difference data:

In standard Bland Altman plots, one plots the difference between methods against the average of the methods, but in this case, the x-axis should be the baseline result, because that is the closest thing we have to the truth.

plot of chunk unnamed-chunk-4

So that is the difference plot for the absolute difference. We can also obviously do the percent difference.

plot of chunk unnamed-chunk-5

Kickin' it Old School

You can also do this in a non-ggplot() paradigm using base plotting utilities as described in this R-bloggers post.


And that, friends, is a way of squishing in a histogram of your sample concentrations into your difference plot which allows you to graphically display your sampling distribution and justify whether you would use parametric or non-parametric statistics to assess the extent of loss of immunoreactivity from storage.

And speaking of scatterplots

“…then the Lord your God will restore your fortunes and have compassion on you and gather you again from all the nations where he scattered you.”
Deut 30:3

Please follow and like us:

A Shiny App for Passing Bablok and Deming Regression


Back in 2011 I was not aware of any tool in R for Passing Bablok (PB) regression, a form of robust regression described in a series of three papers in Clinical Chemistry and Laboratory Medicine (then J Clin Chem and Biochem) available here, here and here. For reasons that are not entirely clear to me, this regression methodology is favoured by clinical chemists but seems largely ignored by other disciplines. However since reviewers clinical chemistry journals will demand the use of PB regression, it seemed expeditious to me to code it in R. This is what spawned a small project for a piece of software to do PB (and Deming and ordinary least squares) regression using a self-contained executable that could be downloaded, unzipped on a Windows Desktop and just ran. You can download here and instructions for installation and use are here and here respectively. The calculations are all done in R, the GUI is built with Python and Py-Qt4 and the executable with cx_freeze. I made it run without an installer because hospital IT often refuse to install software that has not been officially vetted and purchased. The tool was a lot more popular than I anticipated now having about 2000 downloads. In any case, maintenance, upgrades, bug fixing and dealing with operating system updates that break things (like OSX El Capitan's security policies) are no-fun so a Shiny based solution to the same problem makes a lot of sense.


Since 2011, statisticians at Roche Diagnostics programmed the mcr package for PB and Deming regression. Additionally, there is also the MethComp package and the deming package from the Mayo Clinic which both offer PB regression.

Shiny App

Enter Burak Bahar, a like-minded Clinical Pathologist who is currently doing a fellowship at Yale. He liked my cp-R program but he saw the need for a web-based equivalent.

Burak and his wife Ayse, also a physician, have coded a Shiny App for doing Deming, PB and least squares regression in R which is capable of producing publication quality figures and provides all the regression statistics you would need for method-validation or publication. It can also produce a regression report in PDF, Word or HTML. The dynamic duo of the Bahar-MDs deserve all credit here as my only contribution related to suggestions related to usability. This project was presented at the 2016 American Association of Clinical Chemistry meeting in Philadelphia.

The app URL is bahar.shinyapps.io/method_compare. Go to the data tab on the left and then cut and paste your data from an spreadsheet program. Shortcuts CTRL-C (copy) and CTRL-V (paste) work natively in the table. The table is pre-populated with some random data for demonstration purposes. Once your data is pasted in, click on the Plots tab and choose the Bland-Altman or Scatter Plot.


Here is an image generated with the Bahar Shiny app using method comparison data obtained from St. Paul's Hospital Laboratory in migrating from Siemens Immulite 2000 XPi to Roche Cobas e601 for Calcitonin determination. Don't worry, we did more than 33 comparison–I am just showing the low end.



Try adjusting some of the plot parameters. The figures will update in real time. Thanks to Burak and Ayse Bahar for your work!

(Dan's) Parting Thought

There are straight lines that matter a lot more than regression.

I will make justice the measuring line and righteousness the plumb line
(Isa 28:17)

Please follow and like us:

Flat File Interface your Mass Spectrometer to the Laboratory Information System with R

The Problem

As Clinical Pathologists we work hard to create laboratory developed tests (LDTs) using liquid chromatography and tandem mass spectrometry (LC-MS/MS) that are robust, repeatable, accurate and have a wider dynamic range than commercial immunoassays. In our experience, properly developed LC-MS/MS assays are much less expensive and outperform their commercial immunoassay counterparts from an analytical standpoint.

However, despite mass spectrometry's communal obsession with analytical performance of our LDTs, sometimes we overlook the matter of handling the data we generate. Unlike traditional diagnostic companies (e.g. Siemens, Roche) who take care of upload and download of patient data and results via HL7 streams to the laboratory information system (LIS), mass spectrometry companies have not yet made this a priority. This leaves us either paying out a lot of money for custom middleware solutions or manually transcribing our LC-MS/MS results.

We might naively think, “How bad can the transcription be?” but over time, it becomes painfully evident that manual transcription of result is tedious, error–prone and inefficient use of tech–time.

Many LIS vendors offer what is called a “flat-file interface”. In this case, there is no HL7 stream generated using a communication socket between instrument and LIS. Rather, the results are saved in an ASCII text file with a pre-defined format and then transferred to the LIS via a secure shell (SSH) connection.

For this post, we are going to take some sample flat files from a SCIEX API5000 triple quadrupole mass spectrometer and prepare a flat file for the SunQuest LIS. Please note that this code is provided to you as is under the GNU Public Licence and without any guarantee. You know how all the LC-MS/MS vendors say their instruments are for “research use only”? –yeah, I'm giving this to you in the same spirit. If you use or modify it, you do so at your own risk. Any changes to how your flatfile is generated by your mass spectrometer or any upgrades to your LC-MS/MS software could make this code malfunction. You have been warned.

The Required Format

SunQuest requires the output file to be a comma separated values (CSV) file with a unique specimen or internal QC result in each row. The first column is the instrument ID, the second columns is the specimen container ID (an E followed by a 10–digit integer), the third is testcode and the fourth is the result. The file itself is required to have a time–stamp so that it has a traceable name and should have no header. For an instrument named PAPI (short for Providence API 5000) and a testcode TES (for testosterone), the file might look like this:

The Starting Material

After we have completed an analytical run and reviewed all peaks to generate our fileable results, we can export the quatified sample batch to an ASCII text file. The file contains a whole lot of diagnostic information about the run like which multiple reaction monitoring (MRM) transitions we used, what the internal standard (IS) counts were, results from the quantifier and qualifier ion, fitted values for the calibrators etc. There are more than 80 columns in a typical file and we could talk about all the things we might do with this data but in this case, we are concerned with extracting and preparing the results file.

Dialogue Box

If we are actually going to make an R script usable by a human, it would be good to be able to choose which file we want to process and what test we want to extract using a simple graphical user interface (GUI). There are a number of tools one can use to build GUIs in R but the most rudimentary is TclTk. I have to confess that I find the language constructs for GUI creation both non–intuitive and boring. For this reason, I present without discussion, a modification of a recipe for creating a box with radio–buttons. We are going to choose which of three analytes (you can increase this number as you please) for which we wish to process a flat–file. These are: aldosterone, cortisol and testosterone. Please note that if you execute this code on a Mac, you will have to install XQuartz because Macs don't have native X-windows support despite the BSD Linux heritage of OSX.

This will give us the following pop-up window with radiobuttons in which I have selected testosterone.


You will notice that Tk windows do not appear native to the operating system. We can live with this because we are not shallow.

After you hit the OK button, the Tk widget then puts the chosen value into an Tk variable called rbValue. We can determine the value using the command tclvalue(rbValue). The reason we need to know which analyte we are working with is because the name of the MRM we want to pull out of the flat file is dependent on the analyte of course. We will also need to replace results below the limit of quantitation (LoQ) with “< x”, whatever x happens to be, which will be a different threshold for each analyte.

In our case, the testcodes for aldosterone, cortisol and testosterone are ALD,CORT and TES respectively, the LoQs are 50 pmol/L, 1 nmol/L and 0.05 nmol/L and the MRM names are “Aldo 1”, “Aldo 2”, “Cortisol 1”, “Cortisol 2” and “Testo 1” and “Testo 2” as we defined them within SCIEX Analyst Software. We will use the switch() function to define three variables (test.code, LoQ, and MRM.names) which we will use later to process the flat–file. We will also define the name of the worksheet in a variable called worksheet. These are the parameters you would have to change in order to modify the code for your purposes.

Building File Names

Now we will prompt the user to tell them that they are to choose an instrument flat–file and we will determine the path of the chosen file. We will need the path to both read in the appropriate file but also to write the output later.

This code will create this message box:


and this file choice dialogue box:


and after a file is selected and the Open is pressed, the path to the flat–file is stored in the variable flat.file.path.

Behold: The Data

So we chosen the file we want to read in but what does this file look like? To just get a gander at it, we could open it with Excel and see how it is laid out. But since we have broken up with Excel, we won't do this. SCIEX Analyst exports tab (not comma) delimited files. R has a built in function read.delim() for reading these files but we will quickly discover that read.delim() assumes the files have a rectangular structure, having the same number of columns in each row. R will make assumptions about the shape of the data file based on the first few rows and then try to read it in. In this case, it will fail and you will get gibberish. To get this to work for us we will need to tell R how many rows to skip before the real data starts or we will need to tell R the number of columns the file has (which is not guaranteed to be consistent between versions of vendor software). There are lots of ways to do this but I think the simplest is to use grep().

I did this by reading the file in with no parsing of the tabs using the readLines() function. This function creates a vector for which each successive value is the entire content of the row of the file. I display the first 30 lines of the file. Suppose that we chose a testosterone flat file.

All of the \t's that you see are the tabs in the file which are has read in literally when we use readLines(). We can see that in this file nothing of use happens until line 29 but this is not consistent from file to file so we should not just assume that 29 is always the magic number where the good stuff begins. We can see that the line starting “Sample Name \t Sample ID” is the real starting point so we can determine how many lines to skip by using grep() and prepare for some error–handling with a variable called problem by which we can deal with the circumstance that no approriate starting row is identified.

Now that we know how many lines to skip we can read in the data:

We can have a look at the structure of this file

Just Tell Me the Results

And we see that there is lots of stuff we don't need. What we do need are the columns titled “Sample.Name” (which is the specimen container ID in this case), the “Analyte.Peak.Name” (which is the MRM, either quantifier or qualifier), and the one whose name starts with “Calculated.Concentration..”. The last of these also contains the units of measure which is analyte–dependent. To get rid of this analyte–dependence of the column name, we can find out which column this is and rename it:

Now we can pull out the three columns of interest and put them into a dataframe named results.

Now we only need the quantifier ion results which we were defined by the user with Tk GUI, so we can pull them out with grep. I will pull out the qualifiers also but we do not need them unless we wanted to compute ion-ratios, for example.

Having pulled out the MRM of interest, we can define which rows correspond to standards, QC and patients by appropriate use of grep(). It happens that the CIDs all start with E followed by a 10 digit number so we can search for this pattern with a simple regular expression. Since we only need the QCs and patient data, the variable standards is calculated only as a matter of completeness.

Preparing Data for Output

Now we can prepare to write a dataframe corresponding to the required format of the output file. To do so, we'll need to find out how many rows we are writing and then prepare a vector of the same length repeating the name of the worksheet and testcode:

Now we can replace all the NA values that replaced “No Peak” with the correct LoQ according to which analyte we are looking at.

Our final.output.data dataframe looks like it behaved properly.

Timestamping, Writing and Archiving

And finally, we create directories to archive our data (if those directories do not exist) and write the files with an appropriate timestamp determined using Sys.time(). Since colons (i.e : ) don't play nice in all operating systems as filenames, we can use gsub() to get rid of them. We also pass along error messages or confirmation messages to the user as appropriate.