philschmid RSS feed 09月30日
K-fold交叉验证方法详解
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

K-fold是一种常用的交叉验证方法,通过将数据集分成K个小组,轮流使用K-1个小组作为训练集,剩余1个小组作为测试集,从而评估模型的泛化能力。这种方法易于理解和实现,且评估结果比传统验证方法更具信息量。K值通常选择5或10,但无固定规则。K-fold的主要步骤包括:随机打乱数据集、将数据集分成K组、轮流使用K-1组训练模型并评估剩余1组的性能、保留评估分数并丢弃模型、最后使用所有评估分数的均值总结模型性能。K-fold的优势在于充分利用所有数据、获得更多评估指标、提高评估精度,尤其适用于数据集较小的情况。

📊 K-fold是一种交叉验证方法,通过将数据集分成K个小组,轮流使用K-1个小组作为训练集,剩余1个小组作为测试集,从而评估模型的泛化能力。

🔄 K-fold的主要步骤包括:随机打乱数据集、将数据集分成K组、轮流使用K-1组训练模型并评估剩余1组的性能、保留评估分数并丢弃模型、最后使用所有评估分数的均值总结模型性能。

🔍 K值通常选择5或10,但无固定规则。K-fold的优势在于充分利用所有数据、获得更多评估指标、提高评估精度,尤其适用于数据集较小的情况。

📈 K-fold的原理是通过多次训练和评估,减少单一分割带来的随机性,从而更准确地评估模型的性能。

K-fold is a cross-validation method used to estimate the skill of a machine learning model on unseen data. It iscommonly used to validate a model, because it is easy to understand, to implement and results are having a higherinformative value than regular Validation Methods.

Cross-validation is a resampling procedure used to validate machine learning models on a limited data set. The procedurehas a single parameter called K that refers to the number of groups that a given data sample is to be split into, that'sthe reason why it´s called K-fold.

The choice of K is usually 5 or 10, but there is no formal rule. As K is getting larger, the resampling subsets aregetting smaller. The number of K also defines how often your Machine Learning Model is trained. Most of the time wesplit our data into train/validation sets in 80%-20%, 90%-10% or 70%-30% and train our model once. In cross-validation,we split our model K times and then train. Be aware that this will result in longer training processes.

K-Fold steps:

    Shuffle the dataset.Split the dataset into K groups.For each unique group g:
      Take g as a test dataset.Take the remaining groups as a training data set.Fit a model on the training set and evaluate it on the test set.Retain the evaluation score and discard the model.
    Summarize the skill of the model using the sample of model evaluation scores.

The results of a K-fold cross-validation run are often summarized with the mean of the model scores.


Scitkit-Learn Example

The example is a simple implementation with scikit-learn and a scalar numpy array .

    import numpy as np    from sklearn.model_selection import KFold    # data sample    data = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])    # prepare cross validation    kfold = KFold(n_splits=3, shuffle=True, random_state=1)    # enumerate splits    for train, test in kfold.split(data):       print('train: %s, test: %s' % (data[train], data[test]))     #>>> Result    #train: [0.1 0.4 0.5 0.6], test: [0.2 0.3]    #train: [0.2 0.3 0.4 0.6], test: [0.1 0.5]    #train: [0.1 0.2 0.3 0.5], test: [0.4 0.6]

Simpletransformers Example (BERT Text-Classification)

The example is an implementation for a BERT Text-Classification withsimpletransformers library and Scikit-Learn.

    from simpletransformers.classification import ClassificationModel    from sklearn.model_selection import KFold    from sklearn.metrics import accuracy_score    import pandas as pd      # Dataset    dataset = [["Example sentence belonging to class 1", 1],                      ["Example sentence belonging to class 0", 0],                        ["Example eval sentence belonging to class 1", 1],                       ["Example eval sentence belonging to class 0", 0]]    train_data = pd.DataFrame(dataset)     # prepare cross validation    n=5    kf = KFold(n_splits=n, random_state=seed, shuffle=True)     results = []     for train_index, val_index in kf.split(train_data):         # splitting Dataframe (dataset not included)        train_df = train_data.iloc[train_index]        val_df = train_data.iloc[val_index]        # Defining Model        model = ClassificationModel('bert', 'bert-base-uncased')         # train the model        model.train_model(train_df)          # validate the model        result, model_outputs, wrong_predictions = model.eval_model(val_df, acc=accuracy_score)        print(result['acc'])            # append model score        results.append(result['acc'])      print("results",results)    print(f"Mean-Precision: {sum(results) / len(results)}")     #>>> Result    # 0.8535784635587655    # 0.8509520682862771    # 0.855548260013132    # 0.8272010512483574    # 0.8212877792378449    #results [0.8535784635587655,0.8509520682862771,0.855548260013132,    #                     0.8272010512483574,0.8212877792378449]    # Mean-Precision: 0.8407520682862771

Benefits of K-Fold Cross-Validation

Using all data: By using K-fold cross-validation we are using the complete dataset, which is helpful if we have asmall dataset because you split and train your model K times to see its performance instead of wasting X% for yourvalidation dataset.

Getting more metrics: Most of the time you have one result of metrics, but with K-Fold you´ll be able to get Kresults of the metric and can have a deeper look into your model's performance.

Achieving higher precision: By validating your model against multiple “validation-sets” we get a higher level ofreliability. Let’s imagine the following example: We have 3 speakers and 1500 recordings (500 for each speaker). If wedo a simple train/validation split the result could be very different depending on the split.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

K-fold 交叉验证 机器学习 模型评估 数据分割
相关文章