0

我正在尝试进行命名实体识别或从拼音或汉字的罗马化中提取人物,地点等。

例如(来自维基百科):

 "Jiang Zemin, Li Peng and Zhu Rongji led the nation in the 1990s. Under their administration, China's economic performance pulled an estimated 150 million peasants out of poverty and sustained an average annual gross domestic product growth rate of 11.2%.[125][better source needed][126][better source needed] The country joined the World Trade Organization in 2001, and maintained its high rate of economic growth under Hu Jintao and Wen Jiabao's leadership in the 2000s. However, the growth also severely impacted the country's resources and environment,[127][128] and caused major social displacement.[129][130]
Chinese Communist Party general secretary Xi Jinping has ruled since 2012 and has pursued large-scale efforts to reform China's economy [131][132] (which has suffered from structural instabilities and slowing growth),[133][134][135] and has also reformed the one-child policy and prison system,[136] as well as instituting a vast anti corruption crackdown.[137] In 2013, China initiated the Belt and Road Initiative, a global infrastructure investment project.[138] The COVID-19 pandemic broke out in Wuhan, Hubei in 2019.[139][140]"

我希望从上面提取实体,例如:

Jiang Zemin
Li Peng
Zhu Rongji
Hu Jintao
Wuhan
Hubei
etc...

汉字NER很复杂,但我不知道提取拼音的方法。

我目前的计划是尝试 1300 多个汉字的所有排列,如下所示:

import pandas as pd
import numpy as np

#import data
data = pd.read_csv('chinese_tones.txt', sep=" ", header=None)
data.columns = ["pinyin", "character"]

#convert
data['pinyin'] = data['pinyin'].str.replace('\d+', '') #data doesn't have tones, which makes this harder
s = data['pinyin'].drop_duplicates().to_numpy()
combos = pd.Series(np.add.outer(s, s).ravel())

#combine to giant list
all_pinyin = pd.Series(s.tolist() + np.add.outer(s, s).ravel().tolist())

然后,我打算将 .isin()文本数据与拼音列表进行比较。

有谁知道提取实体拼音的更好方法?

4

1 回答 1

1

您可以训练字符级序列标注器(例如 BiLSTM)从序列中提取中文名称。并且您需要为模型制作一些困难的案例(例如某些单词与名称看起来相似)。您可以从这里轻松找到很多中文名称,然后使用一些汉字拼音工具(例如python-pinyin)将中文名称转换为拼音形式。

于 2021-01-15T08:46:13.093 回答