philschmid RSS feed 09月30日 19:15
非英语BERT文本分类模型构建
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何使用Simple Transformers库构建一个非英语的BERT文本分类模型,以解决非英语文本数据分类的难题。以德语为例,通过distilbert-base-german-cased模型对Germeval 2019数据集进行训练,实现侮辱性语言检测与分类,涵盖PROFANITY、INSULT、ABUSE和OTHERS四类,最终F1分数达到0.6895,虽低于当时最佳成绩0.7361,但未调参即达前20名。教程涵盖安装库、选择预训练模型、加载数据集、训练与评估、保存及预测等全流程,并指出未来可通过调参提升性能。

📚 Simple Transformers库基于HuggingFace Transformers,支持多种预训练模型,通过几行代码即可实现Transformer模型微调,本文选用distilbert-base-german-cased模型,参数量较bert-base-uncased减少40%,速度提升60%且性能保留95%以上。

🗣️ 本文以德语处理为例,使用Germeval 2019数据集,包含6309条训练数据和702条测试数据,数据通过wget下载并以pandas处理形成DataFrame,分为INSULT、ABUSE、PROFANITY、OTHER四类,通过lambda函数将类别标签转换为索引。

🚀 训练过程通过ClassificationModel实例化模型,设置num_labels=4和train_args参数(如num_train_epochs=4),调用train_model()函数即可完成模型训练,默认输出_dir目录存储模型文件,每2000步自动保存一次。

📊 评估阶段使用f1_multiclass()计算F1分数,结合accuracy_score评估准确率,最终F1分数0.6895虽未达最佳水平,但基于未调参的初始模型已进入Germeval 2019前20,证明方法有效性。

💾 模型保存与加载通过pack_model()和unpack_model()函数实现,将output目录打包为tar.gz文件便于部署,加载时需指定模型路径并初始化相同参数,预测功能通过model.predict()实现,如将测试句例正确分类为OTHER和INSULT。

Currently, we have 7.5 billion people living on the world in around 200 nations. Only1.2 billion people of them are native English speakers.This leads to a lot of unstructured non-English textual data.

Most of the tutorials and blog posts demonstrate how to build text classification, sentiment analysis,question-answering, or text generation models with BERT based architectures in English. In order to overcome thismissing, I am going to show you how to build a non-English multi-class text classification model.

Opening my article let me guess it’s safe to assume that you have heard of BERT. If you haven’t, or if you’d like arefresh, I recommend reading this paper.

In deep learning, there are currently two options for how to build language models. You can build either monolingualmodels or multilingual models.

"multilingual, or not multilingual, that is the question" - as Shakespeare would have said

Multilingual models describe machine learning models that can understand different languages. An example of amultilingual model is mBERTfrom Google research.This model supports and understands 104 languages.Monolingual models, as the name suggest can understand one language.

Multilingual models are already achieving good results on certain tasks. But these models are bigger, need more data,and also more time to be trained. These properties lead to higher costs due to the larger amount of data and timeresources needed.

Due to this fact, I am going to show you how to train a monolingual non-English BERT-based multi-class textclassification model. Wow, that was a long sehttps://www.philschmid.de/static/blog/bert-text-classification-in-a-different-language/meme.pngfferent-language/meme.png" alt="meme">


Tutorial

We are going to use Simple Transformers - an NLP library basedon the Transformers library by HuggingFace. Simple Transformers allows usto fine-tune Transformer models in a few lines of code.

As the dataset, we are going to use the Germeval 2019, which consists ofGerman tweets. We are going to detect and classify abusive language tweets. These tweets are categorized in 4 classes:PROFANITY, INSULT, ABUSE, and OTHERS. The highest score achieved on this dataset is 0.7361.

We are going to:

    install Simple Transformers libraryselect a pre-trained monolingual modelload the datasettrain/fine-tune our modelevaluate the results of trainingsave the trained modelload the model and predict a real example

I am using Google Colab with a GPU runtime for this tutorial. If you are not sure how to use a GPU Runtime take a lookhere.


Install Simple Transformers library

First, we install simpletransformers with pip. If you are not using Google colab you can check out the installationguide here.

# install simpletransformers!pip install simpletransformers # check installed version!pip freeze | grep simpletransformers# simpletransformers==0.28.2

Select a pre-trained monolingual model

Next, we select the pre-trained model. As mentioned above the Simple Transformers library is based on the Transformerslibrary from HuggingFace. This enables us to use every pre-trained model provided in theTransformers library and all community-uploaded models.For a list that includes all community-uploaded models, I refer tohttps://huggingface.co/models.

We are going to use the distilbert-base-german-cased model, asmaller, faster, cheaper version of BERT. It uses 40%less parameters than bert-base-uncased and runs 60% faster while still preserving over 95% of Bert’s performance.


Load the dataset

The dataset is stored in two text files we can retrieve from thecompetition page. One option to download them is using 2 simple wget CLIcommands.

!wget https://projects.fzai.h-da.de/iggsa/wp-content/uploads/2019/08/germeval2019GoldLabelsSubtask1_2.txt!wget https://projects.fzai.h-da.de/iggsa/wp-content/uploads/2019/09/germeval2019.training_subtask1_2_korrigiert.txt

Afterward, we use some pandas magic to create a dataframe.

import pandas as pd class_list = ['INSULT','ABUSE','PROFANITY','OTHER'] df1 = pd.read_csv('germeval2019GoldLabelsSubtask1_2.txt',sep='\t', lineterminator='\n',encoding='utf8',names=["tweet", "task1", "task2"])df2 = pd.read_csv('germeval2019.training_subtask1_2_korrigiert.txt',sep='\t', lineterminator='\n',encoding='utf8',names=["tweet", "task1", "task2"]) df = pd.concat([df1,df2])df['task2'] = df['task2'].str.replace('\r', "")df['pred_class'] = df.apply(lambda x:  class_list.index(x['task2']),axis=1) df = df[['tweet','pred_class']] print(df.shape)df.head()

Since we don't have a test dataset, we split our dataset — train_df and test_df. We use 90% of the data for training(train_df) and 10% for testing (test_df).

from sklearn.model_selection import train_test_split train_df, test_df = train_test_split(df, test_size=0.10) print('train shape: ',train_df.shape)print('test shape: ',test_df.shape) # train shape:  (6309, 2)# test shape:  (702, 2)

Load pre-trained model

The next step is to load the pre-trained model. We do this by creating a ClassificationModel instance called model.This instance takes the parameters of:

    the architecture (in our case "bert")the pre-trained model ("distilbert-base-german-cased")the number of class labels (4)and our hyperparameter for training (train_args).

You can configure the hyperparameter mwithin a wide range of possibilities. For a detailed description of eachattribute, please refer to thedocumentation.

from simpletransformers.classification import ClassificationModel # define hyperparametertrain_args ={"reprocess_input_data": True,             "fp16":False,             "num_train_epochs": 4} # Create a ClassificationModelmodel = ClassificationModel(    "bert", "distilbert-base-german-cased",    num_labels=4,    args=train_args)

Train/fine-tune our model

To train our model we only need to run model.train_model() and specify which dataset to train on.

model.train_model(train_df)

Evaluate the results of training

After we trained our model successfully we can evaluate it. Therefore we create a simple helper functionf1_multiclass(), which is used to calculate the f1_score. The f1_score is a measure for model accuracy. More onthat here.

from sklearn.metrics import f1_score, accuracy_score def f1_multiclass(labels, preds):    return f1_score(labels, preds, average='micro') result, model_outputs, wrong_predictions = model.eval_model(test_df, f1=f1_multiclass, acc=accuracy_score) # {'acc': 0.6894586894586895,# 'eval_loss': 0.8673831869594075,# 'f1': 0.6894586894586895,# 'mcc': 0.25262380289641617}

We achieved an f1_score of 0.6895. Initially, this seems rather low, but keep in mind: the highest submission atGermeval 2019 was 0.7361. We would have achieved a top 20 rankwithout tuning the hyperparameter. This is pretty impressive!

In a future post, I am going to show you how to achieve a higher f1_score by tuning the hyperparameters.


Save the trained model

Simple Transformers saves the model automatically every 2000 steps and at the end of the training process. Thedefault directory is outputs/. But the output_dir is a hyperparameter and can be overwritten. I created a helperfunction pack_model(), which we use to pack all required model files into a tar.gzfile for deployment.

import osimport tarfile def pack_model(model_path='',file_name=''):  files = [files for root, dirs, files in os.walk(model_path)][0]  with tarfile.open(file_name+ '.tar.gz', 'w:gz') as f:    for file in files:      f.add(f'{model_path}/{file}') # run the functionpack_model('output_path','model_name')

Load the model and predict a real example

As a final step, we load and predict a real example. Since we packed our files a step earlier with pack_model(), wehave to unpack them first. Therefore I wrote another helper function unpack_model() to unpack our model files.

import osimport tarfile def unpack_model(model_name=''):  tar = tarfile.open(f"{model_name}.tar.gz", "r:gz")  tar.extractall()  tar.close() unpack_model('model_name')

To load a saved model, we only need to provide the path to our saved files and initialize it the same way as we did itin the training step. Note: you will need to specify the correct (usually the same used in training) args when loadingthe model.

from simpletransformers.classification import ClassificationModel # define hyperparametertrain_args ={"reprocess_input_data": True,             "fp16":False,             "num_train_epochs": 4} # Create a ClassificationModel with our trained modelmodel = ClassificationModel(    "bert", 'path_to_model/',    num_labels=4,    args=train_args)

After initializing it we can use the model.predict() function to classify an output with a given input. In thisexample, we take a tweet from the Germeval 2018 dataset.

class_list = ['INSULT','ABUSE','PROFANITY','OTHER'] test_tweet1 = "Meine Mutter hat mir erzählt, dass mein Vater einen Wahlkreiskandidaten nicht gewählt hat, weil der gegen die Homo-Ehe ist" predictions, raw_outputs = model.predict([test_tweet1]) print(class_list[predictions[0]])# OTHERtest_tweet2 = "Frau #Böttinger meine Meinung dazu ist sie sollten uns mit ihrem Pferdegebiss nicht weiter belästigen #WDR" predictions, raw_outputs = model.predict([test_tweet2]) print(class_list[predictions[0]])# INSULT

Our model predicted the correct class OTHER and INSULT.


Resume

Concluding, we can say we achieved our goal to create a non-English BERT-based text classification model.

Our example referred to the German language but can easily be transferred into another language. HuggingFace offers alot of pre-trained models for languages like French, Spanish, Italian, Russian, Chinese, ...


Thanks for reading. You can find the colab notebook with the complete codehere.

If you have any questions, feel free to contact me.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Simple Transformers BERT 非英语文本分类 German 机器学习 自然语言处理 文本分类
相关文章