6

我希望将分数(正面、负面或中性)应用于短文本短语。没有解析出表情符号并根据它们的用法做出假设,我不确定还有什么可以尝试的。任何人都可以提供对这个问题进行更词汇分析的示例、研究论文、文章等。

我在想诸如副词使用、标点符号误用/重复、拼写/语法错误之类的东西都可能是作者情绪的体面指标,几乎是二元意义上的(好或坏)。

4

3 回答 3

3

This sounds like a pretty clear binary classification task, where you can simplify the issue to positive or negative, and then make the most entropic decisions or those that haven't reached a threshold of certainty by way of probability mass set to neutral.

Your biggest hurdle will be getting training data for a stochastic machine learning method. You could easily do this with a readily available maximum entropy model such as the Toolkit for Advanced Discriminative Modeling or Mallet. The features you described would just have to be formatted to the inputs these models use.

In order to get training data, you can either do some kind of paid crowdsourcing like Amazon's Mechanical Turk or just do it yourself, maybe with the help of a friend. You'll need a lot of data for this. You can improve the predictive strength of your model in light of a dearth of data with approaches like active learning, ensembling, or boosting, but it's important to test these against real-world data as best as you can and pick what works best in a practical application.

If you're looking for papers for this, you'll want to look at the term 'sentiment analysis' in Google Scholar. The Association for Computational Linguistics has a lot of free and useful papers from conferences and journals which address the problem from a linguistic as well as algorithmic standpoint. I'd also browse their archives. Good luck!

于 2009-06-15T15:59:07.020 回答
2

好吧,潜在语义分析(也有一篇论文)似乎是最接近您所谈论内容的成熟研究领域。它不那么“以价值为导向”,而是更专注于较大的文档,但仍可能与您的问题相关。

于 2009-06-15T15:56:15.937 回答
0

这听起来是一个非常有趣的想法——我很想看看它会产生什么。

我会说标点符号是您可以使用的一种指标...

  • ? - 一个问题
  • !?!?(或某些变体)不相信
  • 用愚蠢,白痴等短语...... - 愤怒
  • ... - 犹豫,讽刺

您也可以尝试使用常见的首字母缩略词,例如...

  • LOL - 笑(积极)
  • WTF,OMG - 难以置信,震惊
  • IMO - 思考,解释

这显然是你想做的一件相当复杂的事情,但听起来很有趣。

于 2009-06-15T15:55:01.937 回答