python - 如何从选定的 pandas.df 行开始一个 for 循环？

Question

使用 for 循环处理 pandas.df 时。我通常会遇到错误。删除错误后，我将不得不从数据帧的开头重新启动 for 循环。如何从错误位置开始 for 循环，摆脱重复运行它。例如：

senti = []
for i in dfs['ssentence']:
   senti.append(get_baidu_senti(i))

在上面的代码中，我试图通过 api 进行情感分析并将它们存储到一个列表中。但是，api 只输入 GBK 格式，而我的数据是用 utf-8 编码的。所以它通常会遇到这样的错误：

UnicodeEncodeError: 'gbk' codec can't encode character '\u30fb' in position 14: illegal multibyte sequence

所以我必须手动删除像'\u30fb'这样的特定项目并重新启动for循环。此时，列表“senti”已经包含了很多数据，所以我不想放弃它们。我能做些什么来改进 for 循环？

score 1 · Accepted Answer

'strict'如果您的 API 需要编码为 GBK，则只需使用除（默认）以外的错误处理程序编码为该编解码器。

'ignore'将丢弃任何无法编码为 GBK 的代码点：

dfs['ssentence_encoded'] = dfs['ssentence'].str.encode('gbk', 'ignore')

请参阅Python文档的 错误处理程序部分codecs。

如果您需要传入字符串，但只有可以安全编码为 GBK 的字符串，那么我会创建一个适合该str.translate()方法的翻译映射：

class InvalidForEncodingMap(dict):
    def __init__(self, encoding):
        self._encoding = encoding
        self._negative = set()
    def __missing__(self, codepoint):
        if codepoint in self._negative:
            raise LookupError(codepoint)
        if chr(codepoint).encode(self._encoding, 'ignore'):
            # can be mapped, record as a negative and raise
            self._negative.add(codepoint)
            raise LookupError(codepoint)
        # map to None to remove
        self[codepoint] = None
        return None

only_gbk = InvalidForEncodingMap('gbk')
dfs['ssentence_gbk_safe'] = dfs['sentence'].str.translate(only_gbk)

该类InvalidForEncodingMap会在查找代码点时懒惰地创建条目，因此仅处理数据中实际存在的代码点。如果您需要多次使用它，我仍然会保留地图实例以供重复使用，它建立的缓存可以这样重复使用。

python - 如何从选定的 pandas.df 行开始一个 for 循环？

1 回答 1

Related

Reference