Text Classification with SciBERT

Yash Gupta
5 min readMar 30, 2021

--

The BERT model has been on the rise lately in the field of NLP and text classification. The model has a transformer architecture with 110 million parameters pre-trained by Google on next word and next sentence prediction task. We use the hugging face library for transformers and pytorch to train our system.

Figure 0 — BERT

This blog describes the process of finetuning a SciBERT model for the task of text classification. Just an idea about transfer learning can be taken from figure 1. If you’re not familiar with the concept I recommend you to read this blog: https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a

Figure 1 — Transfer Learning

In our use case the pre-trained model would be SciBert, a BERT model specifically pre-trained on scientific language. It would be fine tuned by a single classification layer on top. Let’s start!

The Dataset and Analysis

The dataset contains titles of published papers and a label for each published title as to which journal it was published in. We have a total of 21643 such records in the training set and 3373 such records in test set. There are a total of 5 possible labels or journal names for a given title. The label/journal name belongs to the set {‘INFOCOM’ ‘ISCAS’ ‘SIGGRAPH’ ‘VLDB’ ‘WWW’}. The figure below will help you visualize the number of records available for training and testing our system for each given scientific journal.

Figure 2 — Label Distribution in Train Set
Figure 3 — Label distribution in Test Set

In the training set:

The journal INFOCOM has 4481 records. The journal ISCAS has 7514 records. The journal SIGGRAPH has 2678 records. The journal VLDB has 3678 records. The journal WWW has 3292 records.

A word cloud formed by generating a string consisting of all title resulted in the following image.

Figure 4 — Word Cloud of all Titles

Visualizing using a word cloud gives an indication of frequency of every word in the training corpus.

Preparation for training and evaluation

We use some basic boiler plate code for using a general BERT model (take a look at the notebook). A snippet of such code is below where we define one input example of the BERT model.

Figure 5 — One Input Example

However, the features for the BERT model should also be defined for a particular example. The code below defines a feature set for a given record in the training/test set.

Figure 6 — Feature of one input example

The next step is to generate examples from the training and test data. The code below iterates through the given sets to generate examples for training and evaluation.

Figure 7 — Generate examples

We then convert examples to features for our set using a pretrained tokenizer for SciBert. Take a look at the notebook for generating features from examples. The snippets below show you how to import the tokenizer and model. A snippet is also attached to show how the tokenizer works any textual data.

Figure 8 — Importing tokenizer
Figure 9 — Importing pretrained model
Figure 10 — Tokenizer in working

We use the Tesla K80 GPU for training and evaluation available for free with Google Collab. We also set up some parameters of the pretrained model to no decay during training, set the learning rate to 0.0003 and add a warmup of 0.0001. Below snippets show you how some examples in the training, dev and test set look and also an input feature for one example containing input_ids, attendtion_ids, segment_ids and label_ids for a given input example.

Figure 10 — Training, Val/Dev and Test Examples
Figure 11 — Features for one input example

We use the random sampler for training and sequential sampler for evaluation of our system. We fine tune for 4 epochs.

Training and Evaluation

We write the training loop to train our model for 4 epochs. All the computation with the BERT model takes place at the GPU. Gradients are accumulated for 10 steps before optimizing the model which could have lead to exploding gradients but it didn’t. The model was being checked against a validation step after every optimizing step. The output for the training loop can be seen in the below snippet.

Figure 12 — Output from the training loop

The model trains fairly enough as seen from the decreasing training loss after every iteration. However, the performance on the validation set decreases after 3rd optimization hinting towards overfitting. The classification results on the test set can be seen in the below snippet.

Figure 13 — Test Results

In the end

We managed to train a SciBERT system with good F1 scores on multi class classification task. The model might be a bit overfitting and can definitely be improved in design and training. However, it can be used as a baseline for future investigations.

--

--