Quora Questions Pairs using BERT : Overview

Task: Identify wether two question have similar context/meaning or not
kaggle
I have tried this problem using two different approach

  1. Using Naive Bayes Classifier
  2. Using BERT

Naive Bayes Classifier

import pandas as pd
import numpy as np
import os
import nltk
from nltk.stem import WordNetLemmatizer, PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import re
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
path = "/content/drive/MyDrive/Quora Questions NLP"
nltk.download()
train = pd.read_csv(path+"/train.csv")
print("Total samples:",len(train))
train.head(10)
Total samples: 404290
id qid1 qid2 question1 question2 is_duplicate
0 0 1 2 What is the step by step guide to invest in sh... What is the step by step guide to invest in sh... 0
1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Dia... What would happen if the Indian government sto... 0
2 2 5 6 How can I increase the speed of my internet co... How can Internet speed be increased by hacking... 0
3 3 7 8 Why am I mentally very lonely? How can I solve... Find the remainder when [math]23^{24}[/math] i... 0
4 4 9 10 Which one dissolve in water quikly sugar, salt... Which fish would survive in salt water? 0
5 5 11 12 Astrology: I am a Capricorn Sun Cap moon and c... I'm a triple Capricorn (Sun, Moon and ascendan... 1
6 6 13 14 Should I buy tiago? What keeps childern active and far from phone ... 0
7 7 15 16 How can I be a good geologist? What should I do to be a great geologist? 1
8 8 17 18 When do you use シ instead of し? When do you use "&" instead of "and"? 0
9 9 19 20 Motorola (company): Can I hack my Charter Moto... How do I hack Motorola DCX3400 for free internet? 0
print(train.isnull().sum(axis=0))#dropping null values
train.dropna(axis=0,inplace=True)
id              0
qid1            0
qid2            0
question1       1
question2       2
is_duplicate    0
dtype: int64

Text Preprocessing

  • Remove stop words
  • Lemmatize
def preprocess(series):
  #remove characters other than alphabets & numerics
  words = re.sub("[^A-Za-z0-9]"," ",series).lower().split()

  #lemmatize words
  lemm = WordNetLemmatizer()
  stpwords = stopwords.words('english')
  lemmitized = [lemm.lemmatize(word) for word in words if word not in stpwords]
  sent = ' '.join(lemmitized)
  return sent
train['question1'] =train['question1'].apply(preprocess)#Apply preprocessing
train['question2'] =train['question2'].apply(preprocess)
def concat(ser):#concatenate Question 1 & Question 2
  print(ser['question1'])
  return 1
train['combine'] = train.apply(lambda ser: ser['question1'] + " " + ser['question2'],axis=1)
train.head(10)
id qid1 qid2 question1 question2 is_duplicate combine
0 0 1 2 What is the step by step guide to invest in sh... What is the step by step guide to invest in sh... 0 What is the step by step guide to invest in sh...
1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Dia... What would happen if the Indian government sto... 0 What is the story of Kohinoor (Koh-i-Noor) Dia...
2 2 5 6 How can I increase the speed of my internet co... How can Internet speed be increased by hacking... 0 How can I increase the speed of my internet co...
3 3 7 8 Why am I mentally very lonely? How can I solve... Find the remainder when [math]23^{24}[/math] i... 0 Why am I mentally very lonely? How can I solve...
4 4 9 10 Which one dissolve in water quikly sugar, salt... Which fish would survive in salt water? 0 Which one dissolve in water quikly sugar, salt...
5 5 11 12 Astrology: I am a Capricorn Sun Cap moon and c... I'm a triple Capricorn (Sun, Moon and ascendan... 1 Astrology: I am a Capricorn Sun Cap moon and c...
6 6 13 14 Should I buy tiago? What keeps childern active and far from phone ... 0 Should I buy tiago? What keeps childern active...
7 7 15 16 How can I be a good geologist? What should I do to be a great geologist? 1 How can I be a good geologist? What should I d...
8 8 17 18 When do you use シ instead of し? When do you use "&" instead of "and"? 0 When do you use シ instead of し? When do you us...
9 9 19 20 Motorola (company): Can I hack my Charter Moto... How do I hack Motorola DCX3400 for free internet? 0 Motorola (company): Can I hack my Charter Moto...

Convert Words into Vector

cv = TfidfVectorizer(max_features=50000)#Word to Vectors using Tf-Idf

#Take combine questions data as X
X = cv.fit_transform(train['combine'])
y = np.array(train['is_duplicate'])
print(X.shape)

#Tarin-Test Spilt
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.05)
print(X_train.shape,X_test.shape)
(404287, 50000)
(384072, 50000) (20215, 50000)
naive_model = MultinomialNB()#Training
naive_model.fit(X_train,y_train)

#Predictions
y_pred_train = naive_model.predict(X_train)
y_pred_test = naive_model.predict(X_test)
 
accuracy_train = sum((y_pred_train == y_train).astype(int))/len(y_train)
accuracy_test = sum((y_pred_test == y_test).astype(int))/len(y_test)
print(accuracy_train,accuracy_test)
0.7558140140390344 0.7431610190452634

We got 74% Accuracy which is very bad for binary classification problem

BERT

I have used "Semantic Similarity with BERT" code to solve this problem.
reference : https://keras.io/examples/nlp/semantic_similarity_with_bert/

import numpy as np
import pandas as pd
import tensorflow as tf
!pip install transformers==2.11.0
import transformers
max_length = 128  # Maximum length of input sentence to the model.
batch_size = 32
epochs = 2

# Labels in our dataset.
labels = [1,0]
#1 : Non Duplicate
#0 : Duplicate
df = pd.read_csv(path+"/train.csv")

Preprocessing

print(df.isnull().sum(axis=0))#Dropiing Null values
df.dropna(axis=0,inplace=True)
id              0
qid1            0
qid2            0
question1       1
question2       2
is_duplicate    0
dtype: int64
mask = np.random.rand(len(df)) < 0.7
train_df = df[mask]
not_train = df[~mask]

#create mask for val-test distribution
mask = np.random.rand(len(not_train)) < 0.5
test_df = not_train[mask]
val_df = not_train[~mask]
val_df.head(5)
id qid1 qid2 question1 question2 is_duplicate
1 1 3 4 What is the story of Kohinoor (Koh-i-Noor) Dia... What would happen if the Indian government sto... 0
10 10 21 22 Method to find separation of slits using fresn... What are some of the things technicians can te... 0
39 39 79 80 What is the stall speed and AOA of an f-14 wit... Why did aircraft stop using variable-sweep win... 0
63 63 127 128 Why do I always get depressed? Why do I always get depressed in the evening? 0
64 64 129 130 Where can I find a European family office data... Where do I find a U.S. family office database? 0
print(f"Total train samples : {train_df.shape[0]}")
print(f"Total validation samples: {val_df.shape[0]}")
print(f"Total test samples: {test_df.shape[0]}")
Total train samples : 282550
Total validation samples: 60634
Total test samples: 61103
print("Train Target Distribution")
print(train_df.is_duplicate.value_counts())

print("Validation Target Distribution")
print(val_df.is_duplicate.value_counts())
Train Target Distribution
0    178627
1    103923
Name: is_duplicate, dtype: int64
Validation Target Distribution
0    38030
1    22604
Name: is_duplicate, dtype: int64
y_train = tf.keras.utils.to_categorical(train_df.is_duplicate, num_classes=2)#One hot encodding representation
print(f"y_train.shape:{y_train.shape}")

y_val = tf.keras.utils.to_categorical(val_df.is_duplicate, num_classes=2)
print(f"y_val.shape:{y_val.shape}")

y_test = tf.keras.utils.to_categorical(test_df.is_duplicate, num_classes=2)
print(f"y_test.shape:{y_test.shape}")
y_train.shape:(282550, 2)
y_val.shape:(60634, 2)
y_test.shape:(61103, 2)

Custom Data Generator

class BertSemanticDataGenerator(tf.keras.utils.Sequence):
    """Generates batches of data.

    Args:
        sentence_pairs: Array of premise and hypothesis input sentences.
        labels: Array of labels.
        batch_size: Integer batch size.
        shuffle: boolean, whether to shuffle the data.
        include_targets: boolean, whether to incude the labels.

    Returns:
        Tuples `([input_ids, attention_mask, `token_type_ids], labels)`
        (or just `[input_ids, attention_mask, `token_type_ids]`
         if `include_targets=False`)
    """

    def __init__(
        self,
        sentence_pairs,
        labels,
        batch_size=batch_size,
        shuffle=True,
        include_targets=True,
    ):
        self.sentence_pairs = sentence_pairs
        self.labels = labels
        self.shuffle = shuffle
        self.batch_size = batch_size
        self.include_targets = include_targets
        
        # Load our BERT Tokenizer to encode the text.
        # We will use base-base-uncased pretrained model.
        self.tokenizer = transformers.BertTokenizer.from_pretrained(
            "bert-base-uncased", do_lower_case=True
        )
        self.indexes = np.arange(len(self.sentence_pairs))
        self.on_epoch_end()

    def __len__(self):
        # Denotes the number of batches per epoch.
        return len(self.sentence_pairs) // self.batch_size

    def __getitem__(self, idx):
        # Retrieves the batch of index.
        indexes = self.indexes[idx * self.batch_size : (idx + 1) * self.batch_size]
        sentence_pairs = self.sentence_pairs[indexes]

        # With BERT tokenizer's batch_encode_plus batch of both the sentences are
        # encoded together and separated by [SEP] token.
        encoded = self.tokenizer.batch_encode_plus(
            sentence_pairs.tolist(),
            add_special_tokens=True,
            max_length=max_length,
            return_attention_mask=True,
            return_token_type_ids=True,
            pad_to_max_length=True,
            return_tensors="tf",
        )

        # Convert batch of encoded features to numpy array.
        input_ids = np.array(encoded["input_ids"], dtype="int32")
        attention_masks = np.array(encoded["attention_mask"], dtype="int32")
        token_type_ids = np.array(encoded["token_type_ids"], dtype="int32")

        # Set to true if data generator is used for training/validation.
        if self.include_targets:
            labels = np.array(self.labels[indexes], dtype="int32")
            return [input_ids, attention_masks, token_type_ids], labels
        else:
            return [input_ids, attention_masks, token_type_ids]

    def on_epoch_end(self):
        # Shuffle indexes after each epoch if shuffle is set to True.
        if self.shuffle:
            np.random.RandomState(42).shuffle(self.indexes)

Build The Model

strategy = tf.distribute.MirroredStrategy()# Create the model under a distribution strategy scope.

with strategy.scope():
    # Encoded token ids from BERT tokenizer.
    input_ids = tf.keras.layers.Input(
        shape=(max_length,), dtype=tf.int32, name="input_ids"
    )
    # Attention masks indicates to the model which tokens should be attended to.
    attention_masks = tf.keras.layers.Input(
        shape=(max_length,), dtype=tf.int32, name="attention_masks"
    )
    # Token type ids are binary masks identifying different sequences in the model.
    token_type_ids = tf.keras.layers.Input(
        shape=(max_length,), dtype=tf.int32, name="token_type_ids"
    )
    # Loading pretrained BERT model.
    bert_model = transformers.TFBertModel.from_pretrained("bert-base-uncased")
    # Freeze the BERT model to reuse the pretrained features without modifying them.
    bert_model.trainable = False

    sequence_output, pooled_output = bert_model(
        input_ids, attention_mask=attention_masks, token_type_ids=token_type_ids
    )
    # Add trainable layers on top of frozen layers to adapt the pretrained features on the new data.
    bi_lstm = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True))(sequence_output)
    
    # Applying hybrid pooling approach to bi_lstm sequence output.
    avg_pool = tf.keras.layers.GlobalAveragePooling1D()(bi_lstm)
    max_pool = tf.keras.layers.GlobalMaxPooling1D()(bi_lstm)
    concat = tf.keras.layers.concatenate([avg_pool, max_pool])
    dropout = tf.keras.layers.Dropout(0.3)(concat)
    output = tf.keras.layers.Dense(2, activation="softmax")(dropout)
    model = tf.keras.models.Model(
        inputs=[input_ids, attention_masks, token_type_ids], outputs=output
    )

    model.compile(
        optimizer=tf.keras.optimizers.Adam(),
        loss="categorical_crossentropy",
        metrics=["acc"],
    )


print(f"Strategy: {strategy}")
model.summary()
WARNING:tensorflow:There are non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
Strategy: <tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7fa42b661090>
Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_ids (InputLayer)          [(None, 128)]        0                                            
__________________________________________________________________________________________________
attention_masks (InputLayer)    [(None, 128)]        0                                            
__________________________________________________________________________________________________
token_type_ids (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
tf_bert_model_1 (TFBertModel)   ((None, 128, 768), ( 109482240   input_ids[0][0]                  
                                                                 attention_masks[0][0]            
                                                                 token_type_ids[0][0]             
__________________________________________________________________________________________________
bidirectional_1 (Bidirectional) (None, 128, 128)     426496      tf_bert_model_1[0][0]            
__________________________________________________________________________________________________
global_average_pooling1d_1 (Glo (None, 128)          0           bidirectional_1[0][0]            
__________________________________________________________________________________________________
global_max_pooling1d_1 (GlobalM (None, 128)          0           bidirectional_1[0][0]            
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 256)          0           global_average_pooling1d_1[0][0] 
                                                                 global_max_pooling1d_1[0][0]     
__________________________________________________________________________________________________
dropout_75 (Dropout)            (None, 256)          0           concatenate_1[0][0]              
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 2)            514         dropout_75[0][0]                 
==================================================================================================
Total params: 109,909,250
Trainable params: 427,010
Non-trainable params: 109,482,240
__________________________________________________________________________________________________

Create train and validation data generators

train_data = BertSemanticDataGenerator(
    train_df[["question1", "question2"]].values.astype("str"),
    y_train,
    batch_size=batch_size,
    shuffle=True,
)
val_data = BertSemanticDataGenerator(
    val_df[["question1", "question2"]].values.astype("str"),
    y_val,
    batch_size=batch_size,
    shuffle=False,
)

Train the model

Training is done only for the top layers to perform "feature extraction", which will allow the model to use the representations of the pretrained model.

history = model.fit(
    train_data,
    validation_data=val_data,
    epochs=epochs,
     use_multiprocessing=True,
    workers=-1,
)
Epoch 1/2
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
8850/8850 [==============================] - 3281s 366ms/step - loss: 0.4316 - acc: 0.7859 - val_loss: 0.3371 - val_acc: 0.8432
Epoch 2/2
8850/8850 [==============================] - 3229s 365ms/step - loss: 0.3474 - acc: 0.8382 - val_loss: 0.3269 - val_acc: 0.8474

Fine Tuning

Now BERT model has knowledge of Language & Context now we can unfreeze the BERT pretrained weights & retrain using very low learning rate to solve actual NLP problem

bert_model.trainable = True
# Recompile the model to make the change effective.
model.compile(
    optimizer=tf.keras.optimizers.Adam(1e-5),
    loss="categorical_crossentropy",
    metrics=["accuracy"],
)
model.summary()
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_ids (InputLayer)          [(None, 128)]        0                                            
__________________________________________________________________________________________________
attention_masks (InputLayer)    [(None, 128)]        0                                            
__________________________________________________________________________________________________
token_type_ids (InputLayer)     [(None, 128)]        0                                            
__________________________________________________________________________________________________
tf_bert_model (TFBertModel)     ((None, 128, 768), ( 109482240   input_ids[0][0]                  
                                                                 attention_masks[0][0]            
                                                                 token_type_ids[0][0]             
__________________________________________________________________________________________________
bidirectional (Bidirectional)   (None, 128, 128)     426496      tf_bert_model[0][0]              
__________________________________________________________________________________________________
global_average_pooling1d (Globa (None, 128)          0           bidirectional[0][0]              
__________________________________________________________________________________________________
global_max_pooling1d (GlobalMax (None, 128)          0           bidirectional[0][0]              
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 256)          0           global_average_pooling1d[0][0]   
                                                                 global_max_pooling1d[0][0]       
__________________________________________________________________________________________________
dropout_37 (Dropout)            (None, 256)          0           concatenate[0][0]                
__________________________________________________________________________________________________
dense (Dense)                   (None, 2)            514         dropout_37[0][0]                 
==================================================================================================
Total params: 109,909,250
Trainable params: 109,909,250
Non-trainable params: 0
__________________________________________________________________________________________________
history = model.fit(
    train_data,
    validation_data=val_data,
    epochs=epochs,
    use_multiprocessing=True,
    workers=-1,
)
Epoch 1/2
8850/8850 [==============================] - 7739s 874ms/step - loss: 0.2853 - accuracy: 0.8741 - val_loss: 0.2712 - val_accuracy: 0.8832
Epoch 2/2
8850/8850 [==============================] - 7737s 874ms/step - loss: 0.2139 - accuracy: 0.9100 - val_loss: 0.2477 - val_accuracy: 0.8973

After Waiting of 4-5 hours nowour model is trained!

Evaluate on Test Dataset

test_data = BertSemanticDataGenerator(
    test_df[["question1", "question2"]].values.astype("str"),
    y_test,
    batch_size=batch_size,
    shuffle=False,
)
model.evaluate(test_data, verbose=1)
1890/1890 [==============================] - 529s 280ms/step - loss: 0.2438 - accuracy: 0.9001
[0.24381373822689056, 0.9000826478004456]

We Got 90% Accuracy on Test Dataset which is far better than Naive Bayes

def check_similarity(sentence1, sentence2):
  sentence_pairs = np.array([[str(sentence1), str(sentence2)]])
  test_data = BertSemanticDataGenerator(
      sentence_pairs, labels=None, batch_size=1, shuffle=False, include_targets=False,
  )

  proba = model.predict(test_data)[0]
  idx = np.argmax(proba)
  proba = f"{proba[idx]: .2f}%"
  pred = labels[idx]
  return pred, proba

Try the custom Questions

ind = np.random.randint(0,500) 
#Duplicate Questions
q1 = test_df[test_df["is_duplicate"] == 1].iloc[ind]['question1']
q2 = test_df[test_df["is_duplicate"] == 1].iloc[ind]['question2']
print(q1+"\n"+q2)
check_similarity(q1,q2)
Why when I read an English article do I understand most of the words but I cannot understand the message of the article very well? How can I improve my reading comprehension?
I am very poor in English language and even struggle to understand while reading small article also. How can I improve it without attend class?
(0, ' 0.81%')
ind = np.random.randint(0,500)
#Non-Duplicate Questions
q1 = test_df[test_df["is_duplicate"] == 0].iloc[ind]['question1']
q2 = test_df[test_df["is_duplicate"] == 0].iloc[ind]['question2']
print(q1+"\n"+q2)
check_similarity(q1,q2)
How many pullups can an average person do?
How many photos a day does an average person take with their phone.?
(1, ' 1.00%')