-1

我正在从 wwwjdic 示例中的 EDICT 字典文件中提取数据:

相同器官 [そうどうきかん] /(n) homologous organ/
相同染色体 [そうどうせんしょくたい] /(n) homologous chromosome/
相同組換え [そうどうくみかえ] /(n) homologous recombination/
相同的組み換え [そうどうてきくみかえ] /(n) homologous recombination/
相同的組換 [そうどうてきくみかえ] /(n) homologous recombination/
相同的組換え [そうどうてきくみかえ] /(n) homologous recombination/
相入れない [あいいれない] /(iK) (exp,adj-i) in conflict/incompatible/out of harmony/running counter/mutually exclusive/clashing with/
相年 [あいどし] /(n,adj-no) the same age/
相伴 [しょうばん] /(n,vs) partaking/participating/taking part in/sharing (something with someone)/
相伴う [あいともなう] /(v5u) to accompany/
相判 [あいはん] /(n,vs) (1) official seal/verification seal/affixing a seal to an official document/(2) making a joint signature or seal/
相判 [あいばん] /(n) (1) medium-sized paper (approx. 15x21 cm, used for notebooks)/(2) medium-sized photo print (approx. 10x13 cm)/
相判 [あいばん] /(n,vs) (1) official

这些行指定每个条目的词性,即/(n)名词和/(adj)形容词。我有兴趣在此数组中获取所有标记为词性的条目:

["n", "n-adv", "n-pref", "n-suf", "n-t", "num", "pn", "adj-no", "adj-f", "adv-n", "vs"] 

我正在尝试像这样分割线条

file = File.open("EDICT.txt")
file.each_line do |line|
   if line[#Regex]
.
.

我正在使用正则表达式,但我得到的最远的是

/\/[(](n|n-adv|n-pref|n-suf|n-t|num|pn|adj-no|adj-f|adv-n|vs|n,vs)[)]/

这是不健壮的。此外,有时还有这样的标签:

/(adj-no,n-adv,n-t)

与正则表达式不匹配。同时它不应该匹配这些术语:

["adj-i", "adj-na", "adj-pn", "adj-t", "adj", "adv", "adv-to", "aux", "aux-v", "aux-adj", "conj",
"ctr", "exp", "int", "iv", "pref", "prt", "suf", "v1", "v2a-s", "v4h", "v4r", "v5", "v5argu", 
"v5b", "v5g", "v5k", "v5k-s", "v5m", "v5n", "v5r", "v5r-i", "v5s", "v5t", "v5u", "v5u-s", "v5uru",
"v5z", "vz", "vi", "vk", "vn", "vs-c", "vs-i", "vs-s", "vt"] 

有什么更好、更可靠的方法来查看该行是否包含所需的/()标签?

4

1 回答 1

-1
class String
  Nouns = %w[n n-adv n-pref n-suf n-t num pn adj-no adj-f adv-n vs]
  def noun_entry?; self[%r{/\(([^)]+)\)}, 1].split(/,\s*/).&(Nouns).any? end
end

"相同器官 [そうどうきかん] /(n) homologous organ/".noun_entry?
# => true
"相判 [あいばん] /(n,vs) (1) official".noun_entry?
# => true
"ある単語 [あるたんご] /(adj-no,n-adv,n-t) .../".noun_entry?
# => true
"別の単語 [べつのたんご] /(ctr,exp,int) .../".noun_entry?
# => false
  • [^)]什么都不是)
  • [^)]+是一个不包含 的非空序列)
  • ([^)]+)捕获这样的序列。
  • %r{/\(([^)]+)\)}/(是一个正则表达式,其序列由and包围)
  • [regex, 1]取出匹配的第一个捕获,即匹配的任何内容[^)]+
  • split(/,\s*/)用逗号(可选地后跟白色字符)将该序列分隔成一个数组。
  • &(Nouns)取该数组与数组的交集Nouns
  • any?看看交叉路口是否有任何东西。
于 2013-09-03T12:55:19.240 回答