0

我有一排不同语言的句子,语言代码在单独的列中。我指定只处理某些语言(en、es、fr 或 de),因为我知道 AWS Comprehend 不支持“nl”(荷兰语)。出于某种原因,我继续收到不支持“nl”的错误,即使它没有在我的 when 条件中列出,因此不应该通过 Comprehend udf 发送。关于什么可能是错的任何想法?

这是我的代码:

import pyspark.sql.functions as F

def detect_sentiment(text,language):
    comprehend = boto3.client(service_name='comprehend', region_name='us-west-2')
    sentiment_analysis = comprehend.detect_sentiment(Text=text, LanguageCode=language)
    return sentiment_analysis


detect_sentiment_udf = F.udf(detect_sentiment)

reviews_4 = reviews_3.withColumn('RAW_SENTIMENT_SCORE', \
        F.when( (F.col('LANGUAGE')=='en') | (F.col('LANGUAGE')=='es') | (F.col('LANGUAGE')=='fr') | (F.col('LANGUAGE')=='de') , \
               detect_sentiment_udf('SENTENCE', 'LANGUAGE')).otherwise(None) )

reviews_4.show(50)

我收到此错误:

botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the DetectSentiment operation: 
Value 'nl' at 'languageCode'failed to satisfy constraint: Member must satisfy enum value set: [ar, hi, ko, zh-TW, ja, zh, de, pt, en, it, fr, es]
4

1 回答 1

0

仍然不确定为什么我上面的代码不起作用,但我发现以下解决方法是有效的。

def detect_sentiment(text,language):
    comprehend = boto3.client(service_name='comprehend', region_name='us-west-2')
    if (language == 'en') | (language == 'es') | (language == 'fr') | (language == 'de') :
        sentiment_analysis = comprehend.detect_sentiment(Text=text, LanguageCode=language)
        return sentiment_analysis
    else:
        return None

detect_sentiment_udf = F.udf(detect_sentiment)

reviews_4 = reviews_3.withColumn('RAW_SENTIMENT_SCORE', detect_sentiment_udf('SENTENCE','LANGUAGE'))

reviews_4.show(50)
于 2022-01-13T22:03:14.520 回答