1

我正在尝试使用 Google Cloud Natural Language API 对推文进行分类/分类,以过滤掉与我的受众无关的推文(与天气相关)。我可以理解,人工智能解决方案对少量文本进行分类肯定很棘手,但我想它至少会对这样的文本进行猜测:

在早上 6 点到 9 点期间,预计阿肯色州西北部到阿肯色州中北部的寒风将达到零到 -5 度,并延伸到俄克拉荷马州北部的部分地区。#arwx #okwx

我已经测试了几条推文,但只有极少数得到了分类,其余的没有结果(或“未找到类别。尝试更长的文本输入。”如果我通过GUI尝试)。

希望这能奏效是不是毫无意义?或者,是否可以降低分类的阈值?来自 NLP 解决方案的“有根据的猜测”总比没有过滤器要好。是否有替代解决方案(除了训练我自己的 NLP 模型)?

编辑:为了澄清:

最后,我使用谷歌云平台自然语言 API 来对推文进行分类。为了测试它,我正在使用 GUI(上面链接)。我可以看到我测试(在 GUI 中)的推文中很少有从 GCP NLP 获得分类,即类别是空的。

我想要的理想状态是让 GCP NLP 提供推文文本的类别猜测,而不是提供空结果。我假设 NLP 模型会删除任何置信度低于 X% 的结果。知道是否可以配置该阈值会很有趣。

我认为之前必须对推文进行分类,如果有任何其他方法可以解决这个问题?

编辑 2:分类推文代码:

async function classifyTweet(tweetText) {
   const language = require('@google-cloud/language');
   const client = new language.LanguageServiceClient({projectId, keyFilename});
   //const tweetText = "Some light snow dusted the ground this morning, adding to the intense snow fall of yesterday. Here at my Warwick station the numbers are in, New Snow 19.5cm and total depth 26.6cm. A very good snow event. Photos to be posted. #ONStorm #CANWarnON4464 #CoCoRaHSON525"
   const document = {
      content: tweetText,
      type: 'PLAIN_TEXT',
   };   
   const [classification] = await client.classifyText({document});
   
   console.log('Categories:');
   classification.categories.forEach(category => {
     console.log(`Name: ${category.name}, Confidence: ${category.confidence}`);
   });
   
   return classification.categories
}
4

1 回答 1

1

我已经深入研究了云自然语言的当前状态,我对您的主要问题的回答是,在自然语言的当前状态下,对文本进行分类是不可能的。不过,一种解决方法是,如果您将类别基于从分析输入文本中获得的输出。

考虑到我们没有为此使用自定义模型,而只是使用云自然语言提供的选项,关于此问题的一种暂定方法如下:

首先,我已经根据我们的需要更新了官方示例中的代码,以进一步解释这一点:

from google.cloud import language_v1 
from google.cloud.language_v1 import enums 


def sample_cloud_natural_language_text(text_content):
    """ 
    Args:
      text_content The text content to analyze. Must include at least 20 words.
    """

    client = language_v1.LanguageServiceClient()
    type_ = enums.Document.Type.PLAIN_TEXT

    language = "en"
    document = {"content": text_content, "type": type_, "language": language}


    print("=====CLASSIFY TEXT=====")
    response = client.classify_text(document)
    for category in response.categories:
        print(u"Category name: {}".format(category.name))
        print(u"Confidence: {}".format(category.confidence))


    print("=====ANALYZE TEXT=====")
    response = client.analyze_entities(document)
    for entity in response.entities:
        print(f">>>>> ENTITY {entity.name}")  
        print(u"Entity type: {}".format(enums.Entity.Type(entity.type).name))
        print(u"Salience score: {}".format(entity.salience))

        for metadata_name, metadata_value in entity.metadata.items():
            print(u"{}: {}".format(metadata_name, metadata_value))

        for mention in entity.mentions:
            print(u"Mention text: {}".format(mention.text.content))
            print(u"Mention type: {}".format(enums.EntityMention.Type(mention.type).name))


if __name__ == "__main__":
    #text_content = "That actor on TV makes movies in Hollywood and also stars in a variety of popular new TV shows."
    text_content="Wind chills of zero to -5 degrees are expected in Northwestern Arkansas into North-Central Arkansas extending into portions of northern Oklahoma during the 6-9am window"
    
    sample_cloud_natural_language_text(text_content)

输出

=====CLASSIFY TEXT=====
=====ANALYZE TEXT=====
>>>>> ENTITY Wind chills
Entity type: OTHER
Salience score: 0.46825599670410156
Mention text: Wind chills
Mention type: COMMON
>>>>> ENTITY degrees
Entity type: OTHER
Salience score: 0.16041776537895203
Mention text: degrees
Mention type: COMMON
>>>>> ENTITY Northwestern Arkansas
Entity type: ORGANIZATION
Salience score: 0.07702474296092987
mid: /m/02vvkn4
wikipedia_url: https://en.wikipedia.org/wiki/Northwest_Arkansas
Mention text: Northwestern Arkansas
Mention type: PROPER
>>>>> ENTITY North
Entity type: LOCATION
Salience score: 0.07702474296092987
Mention text: North
Mention type: PROPER
>>>>> ENTITY Arkansas
Entity type: LOCATION
Salience score: 0.07088913768529892
mid: /m/0vbk
wikipedia_url: https://en.wikipedia.org/wiki/Arkansas
Mention text: Arkansas
Mention type: PROPER
>>>>> ENTITY window
Entity type: OTHER
Salience score: 0.06348973512649536
Mention text: window
Mention type: COMMON
>>>>> ENTITY Oklahoma
Entity type: LOCATION
Salience score: 0.04747137427330017
wikipedia_url: https://en.wikipedia.org/wiki/Oklahoma
mid: /m/05mph
Mention text: Oklahoma
Mention type: PROPER
>>>>> ENTITY portions
Entity type: OTHER
Salience score: 0.03542650490999222
Mention text: portions
Mention type: COMMON
>>>>> ENTITY 6
Entity type: NUMBER
Salience score: 0.0
value: 6
Mention text: 6
Mention type: TYPE_UNKNOWN
>>>>> ENTITY 9
Entity type: NUMBER
Salience score: 0.0
value: 9
Mention text: 9
Mention type: TYPE_UNKNOWN
>>>>> ENTITY -5
Entity type: NUMBER
Salience score: 0.0
value: -5
Mention text: -5
Mention type: TYPE_UNKNOWN
>>>>> ENTITY zero
Entity type: NUMBER
Salience score: 0.0
value: 0
Mention text: zero
Mention type: TYPE_UNKNOWN

如您所见,classify text没有多大帮助(结果为空)。当我们开始时analyze text,我们可以获得一些价值。我们可以使用它来构建或拥有类别。诀窍(也是艰苦的工作)将是创建适合每个类别(我们构建的类别)的关键词池,我们可以使用它们来设置我们正在分析的数据。关于分类,我们可以查看 google 制作的当前可用类别列表,以了解类别应该是什么样子。

我认为lower the bar当前版本还没有实现任何功能,但它可以作为一项功能向谷歌请求。

于 2022-01-28T17:41:06.450 回答