非英语BERT文本分类模型构建

Currently, we have 7.5 billion people living on the world in around 200 nations. Only1.2 billion people of them are native English speakers.This leads to a lot of unstructured non-English textual data.

Most of the tutorials and blog posts demonstrate how to build text classification, sentiment analysis,question-answering, or text generation models with BERT based architectures in English. In order to overcome thismissing, I am going to show you how to build a non-English multi-class text classification model.

Opening my article let me guess it’s safe to assume that you have heard of BERT. If you haven’t, or if you’d like arefresh, I recommend reading this paper.

In deep learning, there are currently two options for how to build language models. You can build either monolingualmodels or multilingual models.

"multilingual, or not multilingual, that is the question" - as Shakespeare would have said

Multilingual models describe machine learning models that can understand different languages. An example of amultilingual model is mBERTfrom Google research.This model supports and understands 104 languages.Monolingual models, as the name suggest can understand one language.

Multilingual models are already achieving good results on certain tasks. But these models are bigger, need more data,and also more time to be trained. These properties lead to higher costs due to the larger amount of data and timeresources needed.

Due to this fact, I am going to show you how to train a monolingual non-English BERT-based multi-class textclassification model. Wow, that was a long sehttps://www.philschmid.de/static/blog/bert-text-classification-in-a-different-language/meme.pngfferent-language/meme.png" alt="meme">

Tutorial

We are going to use Simple Transformers - an NLP library basedon the Transformers library by HuggingFace. Simple Transformers allows usto fine-tune Transformer models in a few lines of code.

As the dataset, we are going to use the Germeval 2019, which consists ofGerman tweets. We are going to detect and classify abusive language tweets. These tweets are categorized in 4 classes:PROFANITY, INSULT, ABUSE, and OTHERS. The highest score achieved on this dataset is 0.7361.

We are going to:

install Simple Transformers libraryselect a pre-trained monolingual modelload the datasettrain/fine-tune our modelevaluate the results of trainingsave the trained modelload the model and predict a real example

I am using Google Colab with a GPU runtime for this tutorial. If you are not sure how to use a GPU Runtime take a lookhere.

Install Simple Transformers library

First, we install simpletransformers with pip. If you are not using Google colab you can check out the installationguide here.

# install simpletransformers!pip install simpletransformers # check installed version!pip freeze | grep simpletransformers# simpletransformers==0.28.2

Select a pre-trained monolingual model

Next, we select the pre-trained model. As mentioned above the Simple Transformers library is based on the Transformerslibrary from HuggingFace. This enables us to use every pre-trained model provided in theTransformers library and all community-uploaded models.For a list that includes all community-uploaded models, I refer tohttps://huggingface.co/models.

We are going to use the distilbert-base-german-cased model, asmaller, faster, cheaper version of BERT. It uses 40%less parameters than bert-base-uncased and runs 60% faster while still preserving over 95% of Bert’s performance.

Load the dataset

The dataset is stored in two text files we can retrieve from thecompetition page. One option to download them is using 2 simple wget CLIcommands.

!wget https://projects.fzai.h-da.de/iggsa/wp-content/uploads/2019/08/germeval2019GoldLabelsSubtask1_2.txt!wget https://projects.fzai.h-da.de/iggsa/wp-content/uploads/2019/09/germeval2019.training_subtask1_2_korrigiert.txt

Afterward, we use some pandas magic to create a dataframe.

import pandas as pd class_list = ['INSULT','ABUSE','PROFANITY','OTHER'] df1 = pd.read_csv('germeval2019GoldLabelsSubtask1_2.txt',sep='\t', lineterminator='\n',encoding='utf8',names=["tweet", "task1", "task2"])df2 = pd.read_csv('germeval2019.training_subtask1_2_korrigiert.txt',sep='\t', lineterminator='\n',encoding='utf8',names=["tweet", "task1", "task2"]) df = pd.concat([df1,df2])df['task2'] = df['task2'].str.replace('\r', "")df['pred_class'] = df.apply(lambda x:  class_list.index(x['task2']),axis=1) df = df[['tweet','pred_class']] print(df.shape)df.head()

Since we don't have a test dataset, we split our dataset — train_df and test_df. We use 90% of the data for training(train_df) and 10% for testing (test_df).

from sklearn.model_selection import train_test_split train_df, test_df = train_test_split(df, test_size=0.10) print('train shape: ',train_df.shape)print('test shape: ',test_df.shape) # train shape:  (6309, 2)# test shape:  (702, 2)

Load pre-trained model

The next step is to load the pre-trained model. We do this by creating a ClassificationModel instance called model.This instance takes the parameters of:

"bert"

"distilbert-base-german-cased"

4

train_args

You can configure the hyperparameter mwithin a wide range of possibilities. For a detailed description of eachattribute, please refer to thedocumentation.

from simpletransformers.classification import ClassificationModel # define hyperparametertrain_args ={"reprocess_input_data": True,             "fp16":False,             "num_train_epochs": 4} # Create a ClassificationModelmodel = ClassificationModel(    "bert", "distilbert-base-german-cased",    num_labels=4,    args=train_args)

Train/fine-tune our model

To train our model we only need to run model.train_model() and specify which dataset to train on.

model.train_model(train_df)

Evaluate the results of training

After we trained our model successfully we can evaluate it. Therefore we create a simple helper functionf1_multiclass(), which is used to calculate the f1_score. The f1_score is a measure for model accuracy. More onthat here.

from sklearn.metrics import f1_score, accuracy_score def f1_multiclass(labels, preds):    return f1_score(labels, preds, average='micro') result, model_outputs, wrong_predictions = model.eval_model(test_df, f1=f1_multiclass, acc=accuracy_score) # {'acc': 0.6894586894586895,# 'eval_loss': 0.8673831869594075,# 'f1': 0.6894586894586895,# 'mcc': 0.25262380289641617}

We achieved an f1_score of 0.6895. Initially, this seems rather low, but keep in mind: the highest submission atGermeval 2019 was 0.7361. We would have achieved a top 20 rankwithout tuning the hyperparameter. This is pretty impressive!

In a future post, I am going to show you how to achieve a higher f1_score by tuning the hyperparameters.

Save the trained model

Simple Transformers saves the model automatically every 2000 steps and at the end of the training process. Thedefault directory is outputs/. But the output_dir is a hyperparameter and can be overwritten. I created a helperfunction pack_model(), which we use to pack all required model files into a tar.gzfile for deployment.

import osimport tarfile def pack_model(model_path='',file_name=''):  files = [files for root, dirs, files in os.walk(model_path)][0]  with tarfile.open(file_name+ '.tar.gz', 'w:gz') as f:    for file in files:      f.add(f'{model_path}/{file}') # run the functionpack_model('output_path','model_name')

Load the model and predict a real example

As a final step, we load and predict a real example. Since we packed our files a step earlier with pack_model(), wehave to unpack them first. Therefore I wrote another helper function unpack_model() to unpack our model files.

import osimport tarfile def unpack_model(model_name=''):  tar = tarfile.open(f"{model_name}.tar.gz", "r:gz")  tar.extractall()  tar.close() unpack_model('model_name')

To load a saved model, we only need to provide the path to our saved files and initialize it the same way as we did itin the training step. Note: you will need to specify the correct (usually the same used in training) args when loadingthe model.

from simpletransformers.classification import ClassificationModel # define hyperparametertrain_args ={"reprocess_input_data": True,             "fp16":False,             "num_train_epochs": 4} # Create a ClassificationModel with our trained modelmodel = ClassificationModel(    "bert", 'path_to_model/',    num_labels=4,    args=train_args)

After initializing it we can use the model.predict() function to classify an output with a given input. In thisexample, we take a tweet from the Germeval 2018 dataset.

class_list = ['INSULT','ABUSE','PROFANITY','OTHER'] test_tweet1 = "Meine Mutter hat mir erzählt, dass mein Vater einen Wahlkreiskandidaten nicht gewählt hat, weil der gegen die Homo-Ehe ist" predictions, raw_outputs = model.predict([test_tweet1]) print(class_list[predictions[0]])# OTHERtest_tweet2 = "Frau #Böttinger meine Meinung dazu ist sie sollten uns mit ihrem Pferdegebiss nicht weiter belästigen #WDR" predictions, raw_outputs = model.predict([test_tweet2]) print(class_list[predictions[0]])# INSULT

Our model predicted the correct class OTHER and INSULT.

Resume

Concluding, we can say we achieved our goal to create a non-English BERT-based text classification model.

Our example referred to the German language but can easily be transferred into another language. HuggingFace offers alot of pre-trained models for languages like French, Spanish, Italian, Russian, Chinese, ...

Thanks for reading. You can find the colab notebook with the complete codehere.

If you have any questions, feel free to contact me.