tensorflow - 删除 TensorFlow 扩展中的停用词

Question

我必须预处理 NLP 数据，所以我必须从 Tensorflow 数据集中删除停用词（来自 nltk 库）。我尝试了很多这样的事情：

docs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'], ["It's a trap!"]])
tokenizer = text.WhitespaceTokenizer()
tokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))
data = tokenized_docs.filter(lambda x: x. not in stop_words)

或这个：

tokens = docs.map(lambda x: tokenizer.tokenize(x))
data = tokens.filter(lambda x: tf.strings.strip(x).ref() not in stopwords)

但它没有用。第一个代码显示如下错误：RaggedTensor is unhashable.

score 3 · Accepted Answer

据我所知，Tensorflow 使用标准化回调的标准化函数支持基本字符串标准化（小写+标点符号剥离）。似乎不支持更高级的选项，例如在不自己动手的情况下删除停用词。

在 TensorFlow 之外预先进行标准化然后传递结果可能更容易。

import re
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')


def parse_text(text):
    print(f'Input: {text}')

    text = re.sub("[^a-zA-Z]", ' ', text)
    print(f'Remove punctuation and numbers: {text}')

    text = text.lower().split()
    print(f'Lowercase and split: {text}')

    swords = set(stopwords.words("english"))
    text = [w for w in text if w not in swords]
    print(f'Remove stop words: {text}')

    text = " ".join(text)
    print(f'Final: {text}')

    return text


list1 = [["NEver tell me the odds."],["It's a trap!"]]

for sublist in list1:
    for i in range(len(sublist)):
        sublist[i] = parse_text(sublist[i])

print(list1)
# [['never tell odds'], ['trap']]

score 1 · Accepted Answer

您可以在使用 tfx 时使用它来删除停用词

from nltk.corpus import stopwords
outputs['review'] = tf.strings.regex_replace(inputs['review'], r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*',"")

tensorflow - 删除 TensorFlow 扩展中的停用词

2 回答 2

Related

Reference