我需要找到所有可以由字符串中的字母组成的英语单词
sentence="Ziegler's Giant Bar"
我可以通过
sentence.split(//)
如何从 Ruby 中的句子中生成超过 4500 个英语单词?
[编辑]
最好将问题分成几部分:
- 只制作一个包含 10 个或更少字母的单词数组
- 较长的单词可以单独查找
[假设您可以在一个单词中重复使用源字母]:对于字典列表中的每个单词,构造两个字母数组 - 一个用于候选单词,一个用于输入字符串。从单词 array-of-letters 中减去输入的字母 array,如果没有剩下任何字母,那么你就有了匹配项。执行此操作的代码如下所示:
def findWordsWithReplacement(sentence)
out=[]
splitArray=sentence.downcase.split(//)
`cat /usr/share/dict/words`.each{|word|
if (word.strip!.downcase.split(//) - splitArray).empty?
out.push word
end
}
return out
end
您可以像这样从 irb 调试器调用该函数:
output=findWordsWithReplacement("some input string"); puts output.join(" ")
...或者这是一个包装器,您可以使用它来从脚本交互地调用该函数:
puts "enter the text."
ARGF.each {|line|
puts "working..."
out=findWordsWithReplacement(line)
puts out.join(" ")
puts "there were #{out.size} words."
}
在 Mac 上运行时,输出如下所示:
$ ./findwords.rb
输入文本。
Ziegler's Giant Bar
working...
A a aa aal aalii Aani Ab aba abaiser abalienate Abantes Abaris abas abase abaser Abasgi abasia Abassin abatable abater abatis abaze abb Abba abbas abbasi abbassi abbatial abbess Abbie Abe abear Abel abele Abelia Abelian Abelite abelite abeltree Aberia aberrant aberrate abet abettal Abie Abies abietate abietene abietin Abietineae Abiezer Abigail abigail abigeat abilla abintestate
[....]
Z za Zabaean zabeta Zabian zabra zabti zabtie zag zain Zan zanella zant zante Zanzalian zanze Zanzibari zar zaratite zareba zat zati zattare Zea zeal zealless zeallessness zebra zebras Zebrina zebrine zee zein zeist zel zig zigzag zigzagger Zilla zing zingel Zingiber zingiberene Zinnia zinsang Zinzar zira zirai Zirbanit Zirian Zirianian Zizania Zizia zizz
共有 6725 个单词。
那远远超过 4500 个单词,但那是因为 Mac 词典非常大。如果您想准确重现 Knuth 的结果,请从此处下载并解压缩 Knuth 的字典:http: //www.packetstormsecurity.org/Crackers/wordlists/dictionaries/knuth_words.gz并将“/usr/share/dict/words”替换为解压替代目录的路径。如果你做对了,你会得到 4514 个单词,以这个集合结尾:
zanier zanies zaniness Zanzibar zazen zeal zebra zebras Zeiss zeitgeist Zen Zennist zest zestier zeta Ziegler zig zigging zigzag zigzagging zigzags zing zingier zings zinnia
我相信这回答了最初的问题。
或者,提问者/读者可能想要列出可以从字符串构造的所有单词,而无需重复使用任何输入字母。我建议的代码来完成这个工作如下:复制候选词,然后对于输入字符串中的每个字母,从副本中破坏性地删除该字母的第一个实例(使用“slice!”)。如果此过程吸收了所有字母,则接受该单词。
def findWordsNoReplacement(sentence)
out=[]
splitInput=sentence.downcase.split(//)
`cat /usr/share/dict/words`.each{|word|
copy=word.strip!.downcase
splitInput.each {|o| copy.slice!(o) }
out.push word if copy==""
}
return out
end
如果您想查找其字母和频率受给定短语限制的单词,您可以构建一个正则表达式来为您执行此操作:
sentence = "Ziegler's Giant Bar"
# count how many times each letter occurs in the
# sentence (ignoring case, and removing non-letters)
counts = Hash.new(0)
sentence.downcase.gsub(/[^a-z]/,'').split(//).each do |letter|
counts[letter] += 1
end
letters = counts.keys.join
length = counts.values.inject { |a,b| a + b }
# construct a regex that matches upto that many occurences
# of only those letters, ignoring non-letters
# (in a positive look ahead)
length_regex = /(?=^(?:[^a-z]*[#{letters}]){1,#{length}}[^a-z]*$)/i
# construct regexes that matches each letter up to its
# proper frequency (in a positive look ahead)
count_regexes = counts.map do |letter, count|
/(?=^(?:[^#{letter}]*#{letter}){0,#{count}}[^#{letter}]*$)/i
end
# combine the regexes, to form a regex that will only
# match words that are made of a subset of the letters in the string
regex = /#{length_regex}#{count_regexes.join('')}/
# open a big file of words, and find all the ones that match
words = File.open("/usr/share/dict/words") do |f|
f.map { |word| word.chomp }.find_all { |word| regex =~ word }
end
words.length #=> 3182
words #=> ["A", "a", "aa", "aal", "aalii", "Aani", "Ab", "aba", "abaiser", "Abantes",
"Abaris", "abas", "abase", "abaser", "Abasgi", "abate", "abater", "abatis",
...
"ba", "baa", "Baal", "baal", "Baalist", "Baalite", "Baalize", "baar", "bae",
"Baeria", "baetzner", "bag", "baga", "bagani", "bagatine", "bagel", "bagganet",
...
"eager", "eagle", "eaglet", "eagre", "ean", "ear", "earing", "earl", "earlet",
"earn", "earner", "earnest", "earring", "eartab", "ease", "easel", "easer",
...
"gab", "Gabe", "gabi", "gable", "gablet", "Gabriel", "Gael", "gaen", "gaet",
"gag", "gagate", "gage", "gageable", "gagee", "gageite", "gager", "Gaia",
...
"Iberian", "Iberis", "iberite", "ibis", "Ibsenite", "ie", "Ierne", "Igara",
"Igbira", "ignatia", "ignite", "igniter", "Ila", "ilesite", "ilia", "Ilian",
...
"laang", "lab", "Laban", "labia", "labiate", "labis", "labra", "labret", "laet",
"laeti", "lag", "lagan", "lagen", "lagena", "lager", "laggar", "laggen",
...
"Nabal", "Nabalite", "nabla", "nable", "nabs", "nae", "naegate", "naegates",
"nael", "nag", "Naga", "naga", "Nagari", "nagger", "naggle", "nagster", "Naias",
...
"Rab", "rab", "rabat", "rabatine", "Rabi", "rabies", "rabinet", "rag", "raga",
"rage", "rager", "raggee", "ragger", "raggil", "raggle", "raging", "raglan",
...
"sa", "saa", "Saan", "sab", "Saba", "Sabal", "Saban", "sabe", "saber",
"saberleg", "Sabia", "Sabian", "Sabina", "sabina", "Sabine", "sabine", "Sabir",
...
"tabes", "Tabira", "tabla", "table", "tabler", "tables", "tabling", "Tabriz",
"tae", "tael", "taen", "taenia", "taenial", "tag", "Tagabilis", "Tagal",
...
"zest", "zeta", "ziara", "ziarat", "zibeline", "zibet", "ziega", "zieger",
"zig", "zing", "zingel", "Zingiber", "zira", "zirai", "Zirbanit", "Zirian"]
积极的前瞻让您可以创建一个匹配字符串中某些指定模式匹配的位置的正则表达式,而不会消耗匹配的字符串部分。我们在这里使用它们将相同的字符串与单个正则表达式中的多个模式进行匹配。只有当我们所有的模式都匹配时,该位置才匹配。
如果我们允许无限重复使用原始短语中的字母(就像 Knuth 根据glenra的评论所做的那样),那么构建正则表达式会更容易:
sentence = "Ziegler's Giant Bar"
# find all the letters in the sentence
letters = sentence.downcase.gsub(/[^a-z]/,'').split(//).uniq
# construct a regex that matches any line in which
# the only letters used are the ones in the sentence
regex = /^([^a-z]|[#{letters.join}])*$/i
# open a big file of words, and find all the ones that match
words = File.open("/usr/share/dict/words") do |f|
f.map { |word| word.chomp }.find_all { |word| regex =~ word }
end
words.length #=> 6725
words #=> ["A", "a", "aa", "aal", "aalii", "Aani", "Ab", "aba", "abaiser", "abalienate",
...
"azine", "B", "b", "ba", "baa", "Baal", "baal", "Baalist", "Baalite",
"Baalize", "baar", "Bab", "baba", "babai", "Babbie", "Babbitt", "babbitt",
...
"Britannian", "britten", "brittle", "brittleness", "brittling", "Briza",
"brizz", "E", "e", "ea", "eager", "eagerness", "eagle", "eagless", "eaglet",
"eagre", "ean", "ear", "earing", "earl", "earless", "earlet", "earliness",
...
"eternalize", "eternalness", "eternize", "etesian", "etna", "Etnean", "Etta",
"Ettarre", "ettle", "ezba", "Ezra", "G", "g", "Ga", "ga", "gab", "gabber",
"gabble", "gabbler", "Gabe", "gabelle", "gabeller", "gabgab", "gabi", "gable",
...
"grittiness", "grittle", "Grizel", "Grizzel", "grizzle", "grizzler", "grr",
"I", "i", "iba", "Iban", "Ibanag", "Iberes", "Iberi", "Iberia", "Iberian",
...
"itinerarian", "itinerate", "its", "Itza", "Izar", "izar", "izle", "iztle",
"L", "l", "la", "laager", "laang", "lab", "Laban", "labara", "labba", "labber",
...
"litter", "litterer", "little", "littleness", "littling", "littress", "litz",
"Liz", "Lizzie", "Llanberisslate", "N", "n", "na", "naa", "Naassenes", "nab",
"Nabal", "Nabalite", "Nabataean", "Nabatean", "nabber", "nabla", "nable",
...
"niter", "nitraniline", "nitrate", "nitratine", "Nitrian", "nitrile",
"nitrite", "nitter", "R", "r", "ra", "Rab", "rab", "rabanna", "rabat",
"rabatine", "rabatte", "rabbanist", "rabbanite", "rabbet", "rabbeting",
...
"riteless", "ritelessness", "ritling", "rittingerite", "rizzar", "rizzle", "S",
"s", "sa", "saa", "Saan", "sab", "Saba", "Sabaean", "sabaigrass", "Sabaist",
...
"strigine", "string", "stringene", "stringent", "stringentness", "stringer",
"stringiness", "stringing", "stringless", "strit", "T", "t", "ta", "taa",
"Taal", "taar", "Tab", "tab", "tabaret", "tabbarea", "tabber", "tabbinet",
...
"tsessebe", "tsetse", "tsia", "tsine", "tst", "tzaritza", "Tzental", "Z", "z",
"za", "Zabaean", "zabeta", "Zabian", "zabra", "zabti", "zabtie", "zag", "zain",
...
"Zirian", "Zirianian", "Zizania", "Zizia", "zizz"]
我不认为 Ruby 有英文词典。但是您可以尝试将原始字符串的所有排列存储在一个数组中,然后对照 Google 检查这些字符串?说一个词实际上是一个词,如果有超过 100.000 次点击或什么?
你可以得到一个这样的字母数组:
sentence = "Ziegler's Giant Bar"
letters = sentence.split(//)