0

我正在尝试从我拥有的 csv 数据文件生成一个 .arff 文件。现在我对 Weka 完全陌生,并且在一天前就开始使用它。对于初学者,我正在尝试一个简单的 Twitter 情绪分析。我已经在 CSV 中生成了训练数据。CSV文件内容如下:

  tweet,affinScore,polarity
 ATAUTHORcfoblog is giving away a $25 Amex gift card (enter to win over $600 in prizes!) http://t.co/JD8EP14c ,4,4
"American Express has always been my dark horse acquirer of  ATAUTHORFoursquare. Bundle in Square-like payments & its a lite-retailer platform, no? ",0,1
African-American Demos Express Ethnic Identity Differently http://t.co/gInv4bKj via  ATAUTHORmediapost ,0,3
Google ???????? Visa ? American Express  http://t.co/eEZTSiHY ,0,4
Secrets to Success from Small-Business Owners : Lifestyle :: American Express OPEN Forum http://t.co/b85F8JX0 via  ATAUTHOROpenForum ,2,1
RT  ATAUTHORhunterwalk: American Express has always been my dark horse acquirer of  ATAUTHORFoursquare. Bundle in Square-like payments & its a lite ... ,0,1
Winning Surveys $1500 american express Huggies Sweeps http://t.co/WoaTFowp ,4,1
I root for Square mostly because a small business that takes Square is also one that takes American Express. ,0,1
I dont know how bitch be acting American Express but they cards be saying DEBIT ON IT HAVE A ?? PLEASE!!! ,-5,2
Uh oh... RT  ATAUTHORBlackArrowBella: I dont know how bitch be acting American Express but they cards be saying DEBIT ON IT HAVE A ?? PLEASE!!! ,-5,2
Just got another credit card. A Blue Sky card with American Express. Its gonna help pay for the honeymoon!  ATAUTHORAmericanExpress ,-1,1
Follow  ATAUTHORShaveMagazine and ReTweet this msg to be entered to #Win an American Express Gift card. Winners contacted bi-weekly by direct msg! ,2,4
American Express Gold zakelijk aanvragen: http://t.co/xheZwmbt ,0,3
RT  ATAUTHORhunterwalk: American Express has always been my dark horse acquirer of  ATAUTHORFoursquare. Bundle in Square-like payments & its a lite ... ,0,1

这里第一个属性是实际推文,第二个是 AFFIN 分数,第三个是实际分类类别(1-正面,2-负面,3-中性,4-垃圾邮件)

现在我尝试使用代码从中生成 .arff 格式:

import weka.core.Instances;
import weka.core.converters.ArffSaver;
import weka.core.converters.CSVLoader;

import java.io.File;

public class CSV2Arff {
  /**
   * takes 2 arguments:
   * - CSV input file
   * - ARFF output file
   */
  public static void main(String[] args) throws Exception {
    if (args.length != 2) {
      System.out.println("\nUsage: CSV2Arff <input.csv> <output.arff>\n");
      System.exit(1);
    }

    // load CSV
    CSVLoader loader = new CSVLoader();
    loader.setSource(new File(args[0]));
    Instances data = loader.getDataSet();

    // save ARFF
    ArffSaver saver = new ArffSaver();
    saver.setInstances(data);
    saver.setFile(new File(args[1]));
    saver.setDestination(new File(args[1]));
    saver.writeBatch();
  }
}

这会生成 .arff 文件,看起来有点像:

   @relation file

@attribute tweet {_ATAUTHORcfoblog_is_giving_away_a_$25_Amex_gift_card_(enter_to_win_over_$600_in_prizes!)_http://t.co/JD8EP14c_,'American_Express_has_always_been_my_dark_horse_acquirer_of__ATAUTHORFoursquare._Bundle_in_Square-like_payments_&_its_a_lite-retailer_platform,_no?_',African-American_Demos_Express_Ethnic_Identity_Differently_http://t.co/gInv4bKj_via__ATAUTHORmediapost_,Google_????????_Visa_?_American_Express__http://t.co/eEZTSiHY_,Secrets_to_Success_from_Small-Business_Owners_:_Lifestyle_::_American_Express_OPEN_Forum_http://t.co/b85F8JX0_via__ATAUTHOROpenForum_,RT__ATAUTHORhunterwalk:_American_Express_has_always_been_my_dark_horse_acquirer_of__ATAUTHORFoursquare._Bundle_in_Square-like_payments_&_its_a_lite_..._

@data
_ATAUTHORcfoblog_is_giving_away_a_$25_Amex_gift_card_(enter_to_win_over_$600_in_prizes!)_http://t.co/JD8EP14c_,4,4
'American_Express_has_always_been_my_dark_horse_acquirer_of__ATAUTHORFoursquare._Bundle_in_Square-like_payments_&_its_a_lite-retailer_platform,_no?_',0,1
African-American_Demos_Express_Ethnic_Identity_Differently_http://t.co/gInv4bKj_via__ATAUTHORmediapost_,0,3
Google_????????_Visa_?_American_Express__http://t.co/eEZTSiHY_,0,4
Secrets_to_Success_from_Small-Business_Owners_:_Lifestyle_::_American_Express_OPEN_Forum_http://t.co/b85F8JX0_via__ATAUTHOROpenForum_,2,1
RT__ATAUTHORhunterwalk:_American_Express_has_always_been_my_dark_horse_acquirer_of__ATAUTHORFoursquare._Bundle_in_Square-like_payments_&_its_a_lite_..._,0,1

我是 Weka 的新手,但根据我的阅读,我怀疑这个 ARFF 的格式不正确。任何人都可以对此发表评论吗?

另外,如果错了,有人可以指出我到底哪里错了吗?

4

2 回答 2

0

确保将tweet属性的类型设置为任意字符串,而不是分类属性,这似乎是默认值。这不能很好地扩展,因为它会将每条推文的副本放在类型定义中。

请注意,对于推文内容的实际分析,您可能需要进一步预处理它们。您可能需要文本的稀疏向量表示,而不是长字符串。

于 2012-11-14T07:24:42.800 回答
0

如果您使用前面提到的 UI,那么您可以直接将文件加载到 Weka。

如果您只想根据 CSV 文件生成 ARFF 文件,您可以执行以下操作。这取自作为 Weka 一部分的 CSV2Arff 工具。

import weka.core.Instances;
import weka.core.converters.ArffSaver;
import weka.core.converters.CSVLoader;
import java.io.File;

public class CSV2Arff {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
  System.out.println("\nUsage: CSV2Arff <input.csv> <output.arff>\n");
  System.exit(1);
}

// load CSV
CSVLoader loader = new CSVLoader();
loader.setSource(new File(args[0]));
Instances data = loader.getDataSet();

// save ARFF
ArffSaver saver = new ArffSaver();
saver.setInstances(data);
saver.setFile(new File(args[1]));
saver.setDestination(new File(args[1]));
saver.writeBatch();
}
} 
于 2012-11-15T14:34:39.873 回答