python - 如何在 textacy 0.6.2 中初始化“Doc”？

Question

尝试遵循 Python 2 文档中的简单Doc初始化不起作用：

>>> import textacy
>>> content = '''
...     The apparent symmetry between the quark and lepton families of
...     the Standard Model (SM) are, at the very least, suggestive of
...     a more fundamental relationship between them. In some Beyond the
...     Standard Model theories, such interactions are mediated by
...     leptoquarks (LQs): hypothetical color-triplet bosons with both
...     lepton and baryon number and fractional electric charge.'''
>>> metadata = {
...     'title': 'A Search for 2nd-generation Leptoquarks at √s = 7 TeV',
...     'author': 'Burton DeWilde',
...     'pub_date': '2012-08-01'}
>>> doc = textacy.Doc(content, metadata=metadata)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/doc.py", line 120, in __init__
    {compat.unicode_, SpacyDoc}, type(content)))
ValueError: `Doc` must be initialized with set([<type 'unicode'>, <type 'spacy.tokens.doc.Doc'>]) content, not "<type 'str'>"

对于一个字符串或一个字符串序列，这个简单的初始化应该是什么样的？

更新：

传球吐出unicode(content)_textacy.Doc()

ImportError: 'cld2-cffi' must be installed to use textacy's automatic language detection; you may do so via 'pip install cld2-cffi' or 'pip install textacy[lang]'.

从安装 textacy 的那一刻起，imo 就已经很好了。

即使在安装之后cld2-cffi，尝试上面的代码也会抛出

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/doc.py", line 114, in __init__
    self._init_from_text(content, metadata, lang)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/doc.py", line 136, in _init_from_text
    spacy_lang = cache.load_spacy(langstr)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/cachetools/__init__.py", line 46, in wrapper
    v = func(*args, **kwargs)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/cache.py", line 99, in load_spacy
    return spacy.load(name, disable=disable)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/spacy/__init__.py", line 21, in load
    return util.load_model(name, **overrides)
  File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/spacy/util.py", line 120, in load_model
    raise IOError("Can't find model '%s'" % name)
IOError: Can't find model 'en'

score 1 · Accepted Answer

如回溯中所示，问题出在textacy/doc.py函数中_init_from_text()，该函数尝试检测语言并使用第'en'136 行中的字符串调用它。（spacyrepo 在此问题评论中涉及到这一点。）

我通过提供一个有效的lang（unicode）字符串并在and参数字符串u'en_core_web_sm'中使用 unicode解决了这个问题。contentlang

import textacy

content = u'''
    The apparent symmetry between the quark and lepton families of
    the Standard Model (SM) are, at the very least, suggestive of
    a more fundamental relationship between them. In some Beyond the
    Standard Model theories, such interactions are mediated by
    leptoquarks (LQs): hypothetical color-triplet bosons with both
    lepton and baryon number and fractional electric charge.'''

metadata = {
    'title': 'A Search for 2nd-generation Leptoquarks at √s = 7 TeV',
    'author': 'Burton DeWilde',
    'pub_date': '2012-08-01'}

doc = textacy.Doc(content, metadata=metadata, lang=u'en_core_web_sm')

字符串而不是 unicode 字符串（带有神秘的错误消息）会改变行为，缺少包的事实以及使用spacy语言字符串的可能过时/可能不全面的方式对我来说似乎都是错误。‍♂️</p>

score 0 · Accepted Answer

看来您正在使用 Python 2 并出现 unicode 错误。在textacy 文档中，有一条关于使用 Python 2 时 unicode 细微差别的注释：

注意：在几乎所有情况下，textacy（以及spacy）都希望使用 unicode 文本数据。在整个代码中，这表明str与 Python 3 的默认字符串类型一致；但是，Python 2 的用户必须注意使用unicode，并根据需要从默认（字节）字符串类型进行转换。

因此，我会试一试（注意u'''）：

content = u'''
          The apparent symmetry between the quark and lepton families of
          the Standard Model (SM) are, at the very least, suggestive of
          a more fundamental relationship between them. In some Beyond the
          Standard Model theories, such interactions are mediated by
          leptoquarks (LQs): hypothetical color-triplet bosons with both
          lepton and baryon number and fractional electric charge.'''

这Doc按我的预期产生了一个对象（尽管在 Python 3 上）。

python - 如何在 textacy 0.6.2 中初始化“Doc”？

2 回答 2

Related

Reference