6

sense2vec 的文档提到了 3 个主要文件——第一个是 merge_text.py。由于 merge_text.py 尝试打开由 bzip2 压缩的文件,因此我尝试了几种类型的输入 - txt、csv、bzipped 文件。

该文件位于: https ://github.com/spacy-io/sense2vec/blob/master/bin/merge_text.py

这个脚本需要什么类型的输入格式?此外,如果有人可以建议如何训练模型。

4

2 回答 2

5

我扩展并调整了 sense2vec 的代码示例。

您从此输入文本开始:

“就沙特阿拉伯及其动机而言,这也很简单。沙特人擅长金钱和算术。面对亏损的痛苦选择,将目前的产量维持在每桶 60 美元或每天减产 200 万桶。市场和损失更多的钱——这是一个简单的选择:走不那么痛苦的路。如果有次要原因,比如伤害美国致密油生产商或伤害伊朗和俄罗斯,那很好,但实际上只是为了钱。”

对此:

as|ADV far|ADV as|ADP saudi_arabia|ENT and|CCONJ its|ADJ motif|NOUN that|ADJ is|动词非常|ADV 简单|ADJ 也|ADV saudis|ENT are|动词好|ADJ at|ADP 钱|名词和|CCONJ 算术|名词面对|动词 with|ADP pain_choice|名词 of|ADP loss|动词 money|名词维护|动词 current_production|名词 at|ADP us$|SYM 60|MONEY per|ADP 桶|名词或|CCONJ take|动词 two_million|CARDINAL bucket|名词 per|ADP day|名词 off|ADP market|名词和|CCONJ loss|动词 much_more_money|名词 it|PRON 's|动词 easy_choice|名词 take|动词路径|名词 that|ADJ 是|VERB less|ADV paining|ADJ if|ADP there|ADV are|VERB secondary_reason|NOUN like|ADP hurting|VERB us|ENT tight_oil_producer|NOUN or|CCONJ hurting|VERB iran|ENT and|CCONJ russia|ENT's|动词 great|ADJ but|CCONJ it|PRON 's|动词真的|ADV just|ADV about|ADP money|名词

  • 双换行符被解释为单独的文档。
  • URL 被识别为如此,剥离到 domain.tld 并标记为 |URL
  • 名词(也是名词短语的一部分)被词形化(因为动机成为主题)
  • 带有 POS 标记的单词,如 DET(确定的文章)和 PUNCT(用于标点符号)被删除

这是代码。如果您有任何问题,请告诉我。

我可能很快就会在 github.com/woltob 上发布它。

import spacy
import re

nlp = spacy.load('en')
nlp.matcher = None

LABELS = {
    'ENT': 'ENT',
    'PERSON': 'PERSON',
    'NORP': 'ENT',
    'FAC': 'ENT',
    'ORG': 'ENT',
    'GPE': 'ENT',
    'LOC': 'ENT',
    'LAW': 'ENT',
    'PRODUCT': 'ENT',
    'EVENT': 'ENT',
    'WORK_OF_ART': 'ENT',
    'LANGUAGE': 'ENT',
    'DATE': 'DATE',
    'TIME': 'TIME',
    'PERCENT': 'PERCENT',
    'MONEY': 'MONEY',
    'QUANTITY': 'QUANTITY',
    'ORDINAL': 'ORDINAL',
    'CARDINAL': 'CARDINAL'
}

pre_format_re = re.compile(r'^[\`\*\~]')
post_format_re = re.compile(r'[\`\*\~]$')
url_re = re.compile(r'(https?:\/\/)?([a-z0-9-]+\.)?([\d\w]+?\.[^\/]{2,63})')
single_linebreak_re = re.compile('\n')
double_linebreak_re = re.compile('\n{2,}')
whitespace_re = re.compile(r'[ \t]+')
quote_re = re.compile(r'"|`|´')

def strip_meta(text):
    text = text.replace('per cent', 'percent')
    text = text.replace('&gt;', '>').replace('&lt;', '<')
    text = pre_format_re.sub('', text)
    text = post_format_re.sub('', text)
    text = double_linebreak_re.sub('{2break}', text)
    text = single_linebreak_re.sub(' ', text)
    text = text.replace('{2break}', '\n')
    text = whitespace_re.sub(' ', text)
    text = quote_re.sub('', text)
    return text

def transform_doc(doc):
    for ent in doc.ents:
        ent.merge(ent.root.tag_, ent.text, LABELS[ent.label_])
    for np in doc.noun_chunks:
        while len(np) > 1 and np[0].dep_ not in ('advmod', 'amod', 'compound'):
            np = np[1:]
        np.merge(np.root.tag_, np.text, np.root.ent_type_)
    strings = []
    for sent in doc.sents:
        sentence = []
        if sent.text.strip():
            for w in sent:
                if w.is_space:
                    continue
                w_ = represent_word(w)
                if w_:
                    sentence.append(w_)
            strings.append(' '.join(sentence))
    if strings:
        return '\n'.join(strings) + '\n'
    else:
        return ''


def represent_word(word):
    if word.like_url:
        x = url_re.search(word.text.strip().lower())
        if x:
            return x.group(3)+'|URL'
        else:
            return word.text.lower().strip()+'|URL?'
    text = re.sub(r'\s', '_', word.text.strip().lower())
    tag = LABELS.get(word.ent_type_)
    # Dropping PUNCTUATION such as commas and DET like the
    if tag is None and word.pos_ not in ['PUNCT', 'DET']:
        tag = word.pos_
    elif tag is None:
        return None
    # if not word.pos_:
    #    tag = '?'
    return text + '|' + tag

corpus = '''
As far as Saudi Arabia and its motives, that is very simple also. The Saudis are
good at money and arithmetic. Faced with the painful choice of losing money
maintaining current production at US$60 per barrel or taking two million barrels
per day off the market and losing much more money - it's an easy choice: take
the path that is less painful. If there are secondary reasons like hurting US
tight oil producers or hurting Iran and Russia, that's great, but it's really
just about the money.
'''

corpus_stripped = strip_meta(corpus)

doc = nlp(corpus_stripped)
corpus_ = []
for word in doc:
    # only lemmatize NOUN and PROPN
    if word.pos_ in ['NOUN', 'PROPN'] and len(word.text) > 3 and len(word.text) != len(word.lemma_):
        # Keep the original word with the length of the lemma, then add the white space, if it was there.:
        lemma_ = str(word.text[:1]+word.lemma_[1:]+word.text_with_ws[len(word.text):])
            # print(word.text, lemma_)
        corpus_.append(lemma_)
    # print(word.text, word.text[:len(word.lemma_)]+word.text_with_ws[len(word.text):])
    # All other words are added normally.
    else:
        corpus_.append(word.text_with_ws)

result = transform_doc(nlp(''.join(corpus_)))

sense2vec_filename = 'text.txt'
file = open(sense2vec_filename,'w') 
file.write(result)  
file.close() 
print(result)

您可以使用以下方法在 Tensorboard 中使用 Gensim 可视化您的模型: https ://github.com/ArdalanM/gensim2tensorboard

我还将调整此代码以使用 sense2vec 方法(例如,单词在预处理步骤中变为小写,只需在代码中将其注释掉)。

快乐的编码,woltob

于 2017-03-29T15:20:56.180 回答
0

输入文件应该是一个 bzip 压缩的 json。要使用纯文本文件,只需编辑merge_text.py如下:

def iter_comments(loc):
    with bz2.BZ2File(loc) as file_:
        for i, line in enumerate(file_):
            yield line.decode('utf-8', errors='ignore')
            # yield ujson.loads(line)['body']
于 2016-08-09T08:10:34.023 回答