A Deep Learning Classifier of New Testament Verse Authorship using the R Keras Package

Introduction

This is the first of what I am hoping are a number of posts on different machine learning classifiers. The subject matter is not lab medicine but the methodology applies to any similar project. For example, maybe you want to classify the text of a general internal medicine consult into its subspecialty based on the words used or perhaps you want to determine which IT tickets are likely high priority. Maybe you want to convert free text diagnoses into categorical diagnoses. Ultimately, the problem I want to tackle is text classification.

In any case, the book that I have been reading at home is Deep Learning with R by Francois Chollet JJ Allaire and given the many interesting and easy-to-follow examples. Since it’s on my mind, I thought a deep learning model would be a good place to start. But I did not want to just redo one of the examples from the book because the data sets are already cleansed and in that sense much of the heavy lifting is done. I wanted to start from a new data set and use the approach shown in section 3.5 but apply it to a new text classification problem. Because I want to follow the basic flow of the Reuters News Wire classifier, I need a similar natural language processing (NLP) multiclass text classifier problem.

The problem I have chosen is one of for authorship classification. Specifically, given any Greek sentence take from the New Testament, can I make a deep learning classifier that will identify the author of a verse that the classifier has never seen?

Data Cleansing

The text of the New Testament is available online from numerous sources. I downloaded it here and chose the Byzantine Textform 2005 file. The text has been cleansed, put to lower case and transliterated to English characters. There are several steps to get it to a simple dataframe which the following code achieves. The code makes a dataframe where each row is a verse.

Now that this wrangling is complete, we have a tibble that looks like this:

reference verse book author
1:1 biblov genesewv ihsou cristou uiou dauid uiou abraam MT05.ASC matthew
1:2 abraam egennhsen ton isaak isaak de egennhsen ton iakwb iakwb de egennhsen ton ioudan kai touv adelfouv autou MT05.ASC matthew
1:3 ioudav de egennhsen ton farev kai ton zara ek thv yamar farev de egennhsen ton esrwm esrwm de egennhsen ton aram MT05.ASC matthew
1:4 aram de egennhsen ton aminadab aminadab de egennhsen ton naasswn naasswn de egennhsen ton salmwn MT05.ASC matthew
1:5 salmwn de egennhsen ton booz ek thv racab booz de egennhsen ton wbhd ek thv rouy wbhd de egennhsen ton iessai MT05.ASC matthew
1:6 iessai de egennhsen ton dauid ton basilea dauid de o basileuv egennhsen ton solomwna ek thv tou ouriou MT05.ASC matthew
1:7 solomwn de egennhsen ton roboam roboam de egennhsen ton abia abia de egennhsen ton asa MT05.ASC matthew
1:8 asa de egennhsen ton iwsafat iwsafat de egennhsen ton iwram iwram de egennhsen ton ozian MT05.ASC matthew
1:9 oziav de egennhsen ton iwayam iwayam de egennhsen ton acaz acaz de egennhsen ton ezekian MT05.ASC matthew
1:10 ezekiav de egennhsen ton manassh manasshv de egennhsen ton amwn amwn de egennhsen ton iwsian MT05.ASC matthew

We should get verse counts that match what is expected, which we do.

book counts
MT05.ASC 1070
MR05.ASC 677
LU05.ASC 1149
JOH05.ASC 878
AC05.ASC 1003
RO05.ASC 432
1CO05.ASC 436
2CO05.ASC 256
GA05.ASC 148
EPH05.ASC 154
PHP05.ASC 103
COL05.ASC 94
1TH05.ASC 88
2TH05.ASC 46
1TI05.ASC 112
2TI05.ASC 82
TIT05.ASC 45
PHM05.ASC 24
HEB05.ASC 302
JAS05.ASC 107
1PE05.ASC 104
2PE05.ASC 60
1JO05.ASC 104
2JO05.ASC 12
3JO05.ASC 13
JUDE05.ASC 24
RE05.ASC 403

And we can check the unique word count

Normally at this point, we might remove stop words and then stem and lemmatize the text (ie get rid useless little words and suffixes that cause words of the same meaning to look different). This would be more important in more traditional learning classifiers but is likely less important when using Keras and Tensorflow. If I were running this classifier on the English text of the KJV for example, I would run it with and without such a process and guage the performance change. There are numerous NLP packages specifically dedicated to this task. I am going to skip it here. This process is, of course, highly language-dependent.

The other thing I need to do is make the author-factor column numbered 0-8 instead of 1-9 because R is going to be calling python code and python starts counting a 0. This bug took me a while to sort out.

Now we will make a tokenizer, that is a function to convert words to integers and we will limit the model to the top 15000 out of the 17156 unique words found in the text.

Now we need to split the text randomly into training and testing sets in an 80:20 split.

The data is very imbalanced, that is, there are authors (like Jude and James) that have very few verses ascribed to them and there are others (like Paul and Luke) who have many verses. For this reason, we should sanity check our training and testing data to make sure that we have sampled about 80% of each book. There are specific tools to achieve this process which is referred to as stratified sampling.

We can see that we have a problem with author 2 who has only 24 verses. This is probably not going to matter much but we can try balanced sampling for which we do get better proportions.

Now we can tokenize the data, that is, convert the verse from lists of integers to a one-hot encoded form.

Satisfy ourselves that the training data is random in order

reference verse book author author_factor verse_number
16:6 all oti tauta lelalhka umin h luph peplhrwken umwn thn kardian JOH05.ASC john 1 3584
2:20 all ecw kata sou oti afeiv thn gunaika sou iezabel h legei eauthn profhtin kai didaskei kai plana touv emouv doulouv porneusai kai fagein eidwloyuta RE05.ASC john 1 7563
21:23 kai elyonti autw eiv to ieron proshlyon autw didaskonti oi arciereiv kai oi presbuteroi tou laou legontev en poia exousia tauta poieiv kai tiv soi edwken thn exousian tauthn MT05.ASC matthew 5 705
5:1 dikaiwyentev oun ek pistewv eirhnhn ecomen prov ton yeon dia tou kuriou hmwn ihsou cristou RO05.ASC paul 6 4895
12:29 h pwv dunatai tiv eiselyein eiv thn oikian tou iscurou kai ta skeuh autou diarpasai ean mh prwton dhsh ton iscuron kai tote thn oikian autou diarpasei MT05.ASC matthew 5 374
4:24 alla kai di hmav oiv mellei logizesyai toiv pisteuousin epi ton egeiranta ihsoun ton kurion hmwn ek nekrwn RO05.ASC paul 6 4893
27:31 eipen o paulov tw ekatontarch kai toiv stratiwtaiv ean mh outoi meinwsin en tw ploiw umeiv swyhnai ou dunasye AC05.ASC luke 3 4734
1:25 kai hrwthsan auton kai eipon autw ti oun baptizeiv ei su ouk ei o cristov oute hliav oute o profhthv JOH05.ASC john 1 2921
3:6 kai exelyontev oi farisaioi euyewv meta twn hrwdianwn sumboulion epoioun kat autou opwv auton apoleswsin MR05.ASC mark 4 1149
8:4 ei men gar hn epi ghv oud an hn iereuv ontwn twn ierewn twn prosferontwn kata ton nomon ta dwra HEB05.ASC unknown 8 6930

Now we can build a basic model:

and pull out validation data, again in an 80:20 split.

Now we run the model:

plot of chunk unnamed-chunk-15

We can show the model performance graphically:

plot of chunk unnamed-chunk-17

Results are not great because many authors are being misclassified as Paul or Luke. This is likely from author imbalance so we can address the imbalance with weights and with dropout layers as suggested in this very informative tutorial from
Dr. Bharatendra Rai
.

What we get looks a little better with more counts on the diagonal.

plot of chunk unnamed-chunk-19

The model is jumpy on the small books, probably because of undersampling of them. This means that k-fold cross validation help us assess model performance. Not sure if I should try to have balanced sampling in the folds but I am not going to worry about that at the moment.

Run the k-fold cross-validation.

Validation accuracy improves modestly with more epochs but the model definitely overfits the training data (getting to the high 90s in accuracy). This is a bit of a conundrum to me for which I do not know the answer (those who know, please comment): namely, I can overfit the model to make gains on the validation set and these do improve performance on the test set but I expect that this improvement is happening in some non-generalizable way.

plot of chunk unnamed-chunk-22

Likewise loss slowly declines over many epochs but the model overfits.

plot of chunk unnamed-chunk-23

In any case, this is the model performance rerunning the k-fold cross validation with 5 epochs.

Final Outcome

Satisfied enough that 5 epochs should be OK, I can run the model on the whole training set and look at its performance on the testing set.

plot of chunk unnamed-chunk-26

Some interesting findings:

  • John seems to be the easiest to classify. This fits well with his unique authorship style.
  • The synoptic gospels are are easily misclassified among one another. Again, this fits with the overlap of stories, parables and other content.
  • Hebrews looks more like Hebrews than it looks like Paul. This fits with the perspective that Paul is not the author of Hebrews.
  • Poor James, Jude and Peter: just not enough verses to get proper classification. I am sure there are ways to address this kind of imballance were classifying Jude correctly a very important thing to do.

I think I am going to stop trying to improve this because it is not a real problem but I hope that someone else can recycle some code for a real-life problem. I would be interested in comments on how to get improved classification of small classes.

Parting Easter Thought

ouk estin wde hgeryh gar kaywv eipen deute idete ton topon opou ekeito o kuriov, Matthew 28:6