c# - ml.net 关于格式错误和错误值的情绪分析警告

Question

我的 ml.net 控制台应用程序出现问题。这是我第一次在 Visual Studio 中使用 ml.net，所以我按照microsoft.com 的这个教程进行操作，这是一个使用二进制分类的情感分析。

我正在尝试以 tsv 文件的形式处理一些测试数据以获得正面或负面的情绪分析，但在调试时我收到警告，有 1 个格式错误和 2 个错误值。

我决定在 Stack 上向所有伟大的开发人员询问是否有人可以帮助我找到解决方案。

下面是调试的图像：

这是我的测试数据的链接：
wiki-data
wiki-test-data

最后，这是我的代码，供那些重现问题的人使用：

有 2 个 c# 文件：SentimentData.cs 和 Program.cs。

1 - SentimentData.cs：

using System;
using System.Collections.Generic;
using System.Text;
using Microsoft.ML.Runtime.Api;

namespace MachineLearningTut
{
 public class SentimentData
 {
    [Column(ordinal: "0")]
    public string SentimentText;
    [Column(ordinal: "1", name: "Label")]
    public float Sentiment;
 }

 public class SentimentPrediction
 {
    [ColumnName("PredictedLabel")]
    public bool Sentiment;
 }
}

2 - 程序.cs：

using System;
using Microsoft.ML.Models;
using Microsoft.ML.Runtime;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using System.Threading.Tasks;

namespace MachineLearningTut
{
class Program
{
    const string _dataPath = @".\Data\wikipedia-detox-250-line-data.tsv";
    const string _testDataPath = @".\Data\wikipedia-detox-250-line-test.tsv";
    const string _modelpath = @".\Data\Model.zip";

    static async Task Main(string[] args)
    {
        var model = await TrainAsync();

        Evaluate(model);

        Predict(model);
    }

    public static async Task<PredictionModel<SentimentData, SentimentPrediction>> TrainAsync()
    {
        var pipeline = new LearningPipeline();

        pipeline.Add(new TextLoader (_dataPath).CreateFrom<SentimentData>());

        pipeline.Add(new TextFeaturizer("Features", "SentimentText"));

        pipeline.Add(new FastForestBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 });

        PredictionModel<SentimentData, SentimentPrediction> model = pipeline.Train<SentimentData, SentimentPrediction>();

        await model.WriteAsync(path: _modelpath);

        return model;
    }

    public static void Evaluate(PredictionModel<SentimentData, SentimentPrediction> model)
    {
        var testData = new TextLoader(_testDataPath).CreateFrom<SentimentData>();

        var evaluator = new BinaryClassificationEvaluator();

        BinaryClassificationMetrics metrics = evaluator.Evaluate(model, testData);

        Console.WriteLine();
        Console.WriteLine("PredictionModel quality metrics evaluation");
        Console.WriteLine("-------------------------------------");
        Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
        Console.WriteLine($"Auc: {metrics.Auc:P2}");
        Console.WriteLine($"F1Score: {metrics.F1Score:P2}");

    }

    public static void Predict(PredictionModel<SentimentData, SentimentPrediction> model)
    {
        IEnumerable<SentimentData> sentiments = new[]
        {
            new SentimentData
            {
                SentimentText = "Please refrain from adding nonsense to Wikipedia."
            },

            new SentimentData
            {
                SentimentText = "He is the best, and the article should say that."
            }
        };

        IEnumerable<SentimentPrediction> predictions = model.Predict(sentiments);

        Console.WriteLine();
        Console.WriteLine("Sentiment Predictions");
        Console.WriteLine("---------------------");

        var sentimentsAndPredictions = sentiments.Zip(predictions, (sentiment, prediction) => (sentiment, prediction));

        foreach (var item in sentimentsAndPredictions)
        {
            Console.WriteLine($"Sentiment: {item.sentiment.SentimentText} | Prediction: {(item.prediction.Sentiment ? "Positive" : "Negative")}");
        }
        Console.WriteLine();
    }
}

}

如果有人想查看解决方案的代码或更多详细信息，请在聊天中询问我，我会发送。提前致谢！！！[竖起大拇指]

score 1 · Accepted Answer

我想我已经为你解决了。有几点要更新：

首先，我认为您已将SentimentData属性切换到数据所具有的内容。尝试将其更改为

[Column(ordinal: "0", name: "Label")]
public float Sentiment;

[Column(ordinal: "1")]
public string SentimentText;

其次，在方法中使用useHeader参数。TextLoader.CreateFrom不要忘记将其添加到另一个用于验证数据。

pipeline.Add(new TextLoader(_dataPath).CreateFrom<SentimentData>(useHeader: true));

通过这两个更新，我得到了以下输出。看起来不错的模型，AUC 为 85%！

score 0 · Accepted Answer

对文本类型数据集有帮助的另一件事是表明文本有引号：

TextLoader("someFile.txt").CreateFrom<Input>(useHeader: true, allowQuotedStrings: true)

score -1 · Accepted Answer

252 和 253 行的格式值错误。愿我那里包含分隔符的字段。如果您发布代码或示例数据，我们可以更精确。

c# - ml.net 关于格式错误和错误值的情绪分析警告

3 回答 3

Related

Reference