我正在为 ORM 准备一些表名,我想将复数表名转换为单个实体名。我唯一的问题是找到一种可靠的算法。这就是我现在正在做的事情:
- 如果一个单词以-ies结尾,我将其替换为-y
- 如果一个单词以-es结尾,我删除这个结尾。然而,这并不总是有效 - 例如,它将Types替换为Typ
- 否则,我只是删除尾随-s
有人知道更好的算法吗?
我正在为 ORM 准备一些表名,我想将复数表名转换为单个实体名。我唯一的问题是找到一种可靠的算法。这就是我现在正在做的事情:
有人知道更好的算法吗?
这些都是一般规则(也是好的规则),但英语不是胆小的人的语言:-)。
我自己的偏好是拥有一个转换引擎以及一组转换(令人惊讶的是)来完成实际工作。您将运行转换(从特定到一般),并在找到匹配项时,将转换应用于单词并停止。
由于其表现力,正则表达式将是解决此问题的理想方法。示例规则集:
1. If the word is fish, return fish.
2. If the word is sheep, return sheep.
3. If the word is "radii", return "radius".
4. If the word ends in "ii", replace that "ii" with "us" (octopii,virii).
5. If a word ends with -ies, replace the ending with -y
6. If a word ends with -es, remove it.
7. Otherwise, just remove any trailing -s.
请注意保持此转换设置为最新的要求。例如,假设有人添加了表名types
。这目前会被规则捕获#6
,你会得到奇异值typ
,这显然是错误的。
解决方案是在之前 的某处插入新规则#6
,例如:
3.5: If the word is "types", return "type".
对于一个非常具体的转换,或者如果它可以变得更普遍,也许以后的某个地方。
换句话说,当您发现英语在几个世纪以来产生的所有那些奇妙的例外时,您基本上需要保持此转换表的更新。
另一种可能性是根本不要在一般规则上浪费时间。
由于此需求的用例目前只是将表名单数化,而那组表名将相对较小(至少与复数英文单词组相比),只需创建另一个表(或某种数据结构) 调用singulars
,它将所有当前的复数表名称 ( employees
, customers
) 映射到单数对象名称 ( employee
, customer
)。
然后,每次将表添加到您的架构中时,请确保向单数“表”添加一个条目,以便您可以对其进行单数化。
问题是这是基于一般规则,但英语有(象征性地)十亿个例外......你如何处理像“鱼”或“鹅”这样的词?
此外,规则是关于如何将单数名词变为复数。反向映射不一定是可能的(考虑“免费赠品”)。
Andrew Peters 有一个名为Inflector.NET的类,它提供了复数到单数和单数到复数的方法。正如 Tal 指出的那样,没有一种算法是万无一失的,但这涵盖了相当数量的不规则英语名词。
也许看看像 Rails Inflector这样的源代码
另请参阅此答案,它建议使用 Morpha(或研究其背后的算法)。
如果您知道要词形还原的单词是复数名词,那么您可以标记它们NNS
以获得更准确的输出。
输入示例:
$ cat test.txt
Types_NNS
Pies_NNS
Trees_NNS
Buses_NNS
Radii_NNS
Communities_NNS
Sheep_NNS
Fish_NNS
输出示例:
$ cat test.txt | ./morpha -c
Type
Pie
Tree
Bus
Radius
Community
Sheep
Fish
作为一项改进,您可以使用生成多种可能性的规则,然后在字典中查找结果以排除不可能的选项。
例如,将 -ies 替换为 -y 和 -ie。派变成派和派。字典中只有一个,所以选择那个。
也许您甚至可以找到包含频率信息的字典并选择您生成的最常见的单词。
如果将其与涵盖一些例外的有序规则列表结合起来,您可能会获得相当高的准确性。
也许你需要这个,如果你知道如何使用 PHP 脚本,它会很好用。它可以将复数单词转换为单个单词,也可以将单个单词转换为复数单词。
class BaseInflector
{
/**
* @var array the rules for converting a word into its plural form.
* The keys are the regular expressions and the values are the corresponding replacements.
*/
public static $plurals = [
'/([nrlm]ese|deer|fish|sheep|measles|ois|pox|media)$/i' => '\1',
'/^(sea[- ]bass)$/i' => '\1',
'/(m)ove$/i' => '\1oves',
'/(f)oot$/i' => '\1eet',
'/(h)uman$/i' => '\1umans',
'/(s)tatus$/i' => '\1tatuses',
'/(s)taff$/i' => '\1taff',
'/(t)ooth$/i' => '\1eeth',
'/(quiz)$/i' => '\1zes',
'/^(ox)$/i' => '\1\2en',
'/([m|l])ouse$/i' => '\1ice',
'/(matr|vert|ind)(ix|ex)$/i' => '\1ices',
'/(x|ch|ss|sh)$/i' => '\1es',
'/([^aeiouy]|qu)y$/i' => '\1ies',
'/(hive)$/i' => '\1s',
'/(?:([^f])fe|([lr])f)$/i' => '\1\2ves',
'/sis$/i' => 'ses',
'/([ti])um$/i' => '\1a',
'/(p)erson$/i' => '\1eople',
'/(m)an$/i' => '\1en',
'/(c)hild$/i' => '\1hildren',
'/(buffal|tomat|potat|ech|her|vet)o$/i' => '\1oes',
'/(alumn|bacill|cact|foc|fung|nucle|radi|stimul|syllab|termin|vir)us$/i' => '\1i',
'/us$/i' => 'uses',
'/(alias)$/i' => '\1es',
'/(ax|cris|test)is$/i' => '\1es',
'/s$/' => 's',
'/^$/' => '',
'/$/' => 's',
];
/**
* @var array the rules for converting a word into its singular form.
* The keys are the regular expressions and the values are the corresponding replacements.
*/
public static $singulars = [
'/([nrlm]ese|deer|fish|sheep|measles|ois|pox|media|ss)$/i' => '\1',
'/^(sea[- ]bass)$/i' => '\1',
'/(s)tatuses$/i' => '\1tatus',
'/(f)eet$/i' => '\1oot',
'/(t)eeth$/i' => '\1ooth',
'/^(.*)(menu)s$/i' => '\1\2',
'/(quiz)zes$/i' => '\\1',
'/(matr)ices$/i' => '\1ix',
'/(vert|ind)ices$/i' => '\1ex',
'/^(ox)en/i' => '\1',
'/(alias)(es)*$/i' => '\1',
'/(alumn|bacill|cact|foc|fung|nucle|radi|stimul|syllab|termin|viri?)i$/i' => '\1us',
'/([ftw]ax)es/i' => '\1',
'/(cris|ax|test)es$/i' => '\1is',
'/(shoe|slave)s$/i' => '\1',
'/(o)es$/i' => '\1',
'/ouses$/' => 'ouse',
'/([^a])uses$/' => '\1us',
'/([m|l])ice$/i' => '\1ouse',
'/(x|ch|ss|sh)es$/i' => '\1',
'/(m)ovies$/i' => '\1\2ovie',
'/(s)eries$/i' => '\1\2eries',
'/([^aeiouy]|qu)ies$/i' => '\1y',
'/([lr])ves$/i' => '\1f',
'/(tive)s$/i' => '\1',
'/(hive)s$/i' => '\1',
'/(drive)s$/i' => '\1',
'/([^fo])ves$/i' => '\1fe',
'/(^analy)ses$/i' => '\1sis',
'/(analy|diagno|^ba|(p)arenthe|(p)rogno|(s)ynop|(t)he)ses$/i' => '\1\2sis',
'/([ti])a$/i' => '\1um',
'/(p)eople$/i' => '\1\2erson',
'/(m)en$/i' => '\1an',
'/(c)hildren$/i' => '\1\2hild',
'/(n)ews$/i' => '\1\2ews',
'/(n)etherlands$/i' => '\1\2etherlands',
'/eaus$/' => 'eau',
'/^(.*us)$/' => '\\1',
'/s$/i' => '',
];
/**
* @var array the special rules for converting a word between its plural form and singular form.
* The keys are the special words in singular form, and the values are the corresponding plural form.
*/
public static $specials = [
'atlas' => 'atlases',
'beef' => 'beefs',
'brother' => 'brothers',
'cafe' => 'cafes',
'child' => 'children',
'cookie' => 'cookies',
'corpus' => 'corpuses',
'cow' => 'cows',
'curve' => 'curves',
'foe' => 'foes',
'ganglion' => 'ganglions',
'genie' => 'genies',
'genus' => 'genera',
'graffito' => 'graffiti',
'hoof' => 'hoofs',
'loaf' => 'loaves',
'man' => 'men',
'money' => 'monies',
'mongoose' => 'mongooses',
'move' => 'moves',
'mythos' => 'mythoi',
'niche' => 'niches',
'numen' => 'numina',
'occiput' => 'occiputs',
'octopus' => 'octopuses',
'opus' => 'opuses',
'ox' => 'oxen',
'penis' => 'penises',
'sex' => 'sexes',
'soliloquy' => 'soliloquies',
'testis' => 'testes',
'trilby' => 'trilbys',
'turf' => 'turfs',
'wave' => 'waves',
'Amoyese' => 'Amoyese',
'bison' => 'bison',
'Borghese' => 'Borghese',
'bream' => 'bream',
'breeches' => 'breeches',
'britches' => 'britches',
'buffalo' => 'buffalo',
'cantus' => 'cantus',
'carp' => 'carp',
'chassis' => 'chassis',
'clippers' => 'clippers',
'cod' => 'cod',
'coitus' => 'coitus',
'Congoese' => 'Congoese',
'contretemps' => 'contretemps',
'corps' => 'corps',
'debris' => 'debris',
'diabetes' => 'diabetes',
'djinn' => 'djinn',
'eland' => 'eland',
'elk' => 'elk',
'equipment' => 'equipment',
'Faroese' => 'Faroese',
'flounder' => 'flounder',
'Foochowese' => 'Foochowese',
'gallows' => 'gallows',
'Genevese' => 'Genevese',
'Genoese' => 'Genoese',
'Gilbertese' => 'Gilbertese',
'graffiti' => 'graffiti',
'headquarters' => 'headquarters',
'herpes' => 'herpes',
'hijinks' => 'hijinks',
'Hottentotese' => 'Hottentotese',
'information' => 'information',
'innings' => 'innings',
'jackanapes' => 'jackanapes',
'Kiplingese' => 'Kiplingese',
'Kongoese' => 'Kongoese',
'Lucchese' => 'Lucchese',
'mackerel' => 'mackerel',
'Maltese' => 'Maltese',
'mews' => 'mews',
'moose' => 'moose',
'mumps' => 'mumps',
'Nankingese' => 'Nankingese',
'news' => 'news',
'nexus' => 'nexus',
'Niasese' => 'Niasese',
'Pekingese' => 'Pekingese',
'Piedmontese' => 'Piedmontese',
'pincers' => 'pincers',
'Pistoiese' => 'Pistoiese',
'pliers' => 'pliers',
'Portuguese' => 'Portuguese',
'proceedings' => 'proceedings',
'rabies' => 'rabies',
'rice' => 'rice',
'rhinoceros' => 'rhinoceros',
'salmon' => 'salmon',
'Sarawakese' => 'Sarawakese',
'scissors' => 'scissors',
'series' => 'series',
'Shavese' => 'Shavese',
'shears' => 'shears',
'siemens' => 'siemens',
'species' => 'species',
'swine' => 'swine',
'testes' => 'testes',
'trousers' => 'trousers',
'trout' => 'trout',
'tuna' => 'tuna',
'Vermontese' => 'Vermontese',
'Wenchowese' => 'Wenchowese',
'whiting' => 'whiting',
'wildebeest' => 'wildebeest',
'Yengeese' => 'Yengeese',
];
/**
* @var array fallback map for transliteration used by [[transliterate()]] when intl isn't available.
*/
public static $transliteration = [
'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => 'A', 'Å' => 'A', 'Æ' => 'AE', 'Ç' => 'C',
'È' => 'E', 'É' => 'E', 'Ê' => 'E', 'Ë' => 'E', 'Ì' => 'I', 'Í' => 'I', 'Î' => 'I', 'Ï' => 'I',
'Ð' => 'D', 'Ñ' => 'N', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O', 'Õ' => 'O', 'Ö' => 'O', 'Ő' => 'O',
'Ø' => 'O', 'Ù' => 'U', 'Ú' => 'U', 'Û' => 'U', 'Ü' => 'U', 'Ű' => 'U', 'Ý' => 'Y', 'Þ' => 'TH',
'ß' => 'ss',
'à' => 'a', 'á' => 'a', 'â' => 'a', 'ã' => 'a', 'ä' => 'a', 'å' => 'a', 'æ' => 'ae', 'ç' => 'c',
'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e', 'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i',
'ð' => 'd', 'ñ' => 'n', 'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'o', 'ő' => 'o',
'ø' => 'o', 'ù' => 'u', 'ú' => 'u', 'û' => 'u', 'ü' => 'u', 'ű' => 'u', 'ý' => 'y', 'þ' => 'th',
'ÿ' => 'y',
];
/**
* Shortcut for `Any-Latin; NFKD` transliteration rule. The rule is strict, letters will be transliterated with
* the closest sound-representation chars. The result may contain any UTF-8 chars. For example:
* `获取到 どちら Українська: ґ,є, Српска: ђ, њ, џ! ¿Español?` will be transliterated to
* `huò qǔ dào dochira Ukraí̈nsʹka: g̀,ê, Srpska: đ, n̂, d̂! ¿Español?`
*
* Used in [[transliterate()]].
* For detailed information see [unicode normalization forms](http://unicode.org/reports/tr15/#Normalization_Forms_Table)
* @see http://unicode.org/reports/tr15/#Normalization_Forms_Table
* @see transliterate()
* @since 2.0.7
*/
const TRANSLITERATE_STRICT = 'Any-Latin; NFKD';
/**
* Shortcut for `Any-Latin; Latin-ASCII` transliteration rule. The rule is medium, letters will be
* transliterated to characters of Latin-1 (ISO 8859-1) ASCII table. For example:
* `获取到 どちら Українська: ґ,є, Српска: ђ, њ, џ! ¿Español?` will be transliterated to
* `huo qu dao dochira Ukrainsʹka: g,e, Srpska: d, n, d! ¿Espanol?`
*
* Used in [[transliterate()]].
* For detailed information see [unicode normalization forms](http://unicode.org/reports/tr15/#Normalization_Forms_Table)
* @see http://unicode.org/reports/tr15/#Normalization_Forms_Table
* @see transliterate()
* @since 2.0.7
*/
const TRANSLITERATE_MEDIUM = 'Any-Latin; Latin-ASCII';
/**
* Shortcut for `Any-Latin; Latin-ASCII; [\u0080-\uffff] remove` transliteration rule. The rule is loose,
* letters will be transliterated with the characters of Basic Latin Unicode Block.
* For example:
* `获取到 どちら Українська: ґ,є, Српска: ђ, њ, џ! ¿Español?` will be transliterated to
* `huo qu dao dochira Ukrainska: g,e, Srpska: d, n, d! Espanol?`
*
* Used in [[transliterate()]].
* For detailed information see [unicode normalization forms](http://unicode.org/reports/tr15/#Normalization_Forms_Table)
* @see http://unicode.org/reports/tr15/#Normalization_Forms_Table
* @see transliterate()
* @since 2.0.7
*/
const TRANSLITERATE_LOOSE = 'Any-Latin; Latin-ASCII; [\u0080-\uffff] remove';
/**
* @var mixed Either a [[\Transliterator]], or a string from which a [[\Transliterator]] can be built
* for transliteration. Used by [[transliterate()]] when intl is available. Defaults to [[TRANSLITERATE_LOOSE]]
* @see http://php.net/manual/en/transliterator.transliterate.php
*/
public static $transliterator = self::TRANSLITERATE_LOOSE;
/**
* Converts a word to its plural form.
* Note that this is for English only!
* For example, 'apple' will become 'apples', and 'child' will become 'children'.
* @param string $word the word to be pluralized
* @return string the pluralized word
*/
public static function pluralize($word)
{
if (isset(static::$specials[$word])) {
return static::$specials[$word];
}
foreach (static::$plurals as $rule => $replacement) {
if (preg_match($rule, $word)) {
return preg_replace($rule, $replacement, $word);
}
}
return $word;
}
/**
* Returns the singular of the $word
* @param string $word the english word to singularize
* @return string Singular noun.
*/
public static function singularize($word)
{
$result = array_search($word, static::$specials, true);
if ($result !== false) {
return $result;
}
foreach (static::$singulars as $rule => $replacement) {
if (preg_match($rule, $word)) {
return preg_replace($rule, $replacement, $word);
}
}
return $word;
}
/**
* Converts an underscored or CamelCase word into a English
* sentence.
* @param string $words
* @param boolean $ucAll whether to set all words to uppercase
* @return string
*/
public static function titleize($words, $ucAll = false)
{
$words = static::humanize(static::underscore($words), $ucAll);
return $ucAll ? ucwords($words) : ucfirst($words);
}
/**
* Returns given word as CamelCased
* Converts a word like "send_email" to "SendEmail". It
* will remove non alphanumeric character from the word, so
* "who's online" will be converted to "WhoSOnline"
* @see variablize()
* @param string $word the word to CamelCase
* @return string
*/
public static function camelize($word)
{
return str_replace(' ', '', ucwords(preg_replace('/[^A-Za-z0-9]+/', ' ', $word)));
}
/**
* Converts a CamelCase name into space-separated words.
* For example, 'PostTag' will be converted to 'Post Tag'.
* @param string $name the string to be converted
* @param boolean $ucwords whether to capitalize the first letter in each word
* @return string the resulting words
*/
public static function camel2words($name, $ucwords = true)
{
$label = trim(strtolower(str_replace([
'-',
'_',
'.'
], ' ', preg_replace('/(?<![A-Z])[A-Z]/', ' \0', $name))));
return $ucwords ? ucwords($label) : $label;
}
/**
* Converts a CamelCase name into an ID in lowercase.
* Words in the ID may be concatenated using the specified character (defaults to '-').
* For example, 'PostTag' will be converted to 'post-tag'.
* @param string $name the string to be converted
* @param string $separator the character used to concatenate the words in the ID
* @param boolean|string $strict whether to insert a separator between two consecutive uppercase chars, defaults to false
* @return string the resulting ID
*/
public static function camel2id($name, $separator = '-', $strict = false)
{
$regex = $strict ? '/[A-Z]/' : '/(?<![A-Z])[A-Z]/';
if ($separator === '_') {
return trim(strtolower(preg_replace($regex, '_\0', $name)), '_');
} else {
return trim(strtolower(str_replace('_', $separator, preg_replace($regex, $separator . '\0', $name))), $separator);
}
}
/**
* Converts an ID into a CamelCase name.
* Words in the ID separated by `$separator` (defaults to '-') will be concatenated into a CamelCase name.
* For example, 'post-tag' is converted to 'PostTag'.
* @param string $id the ID to be converted
* @param string $separator the character used to separate the words in the ID
* @return string the resulting CamelCase name
*/
public static function id2camel($id, $separator = '-')
{
return str_replace(' ', '', ucwords(implode(' ', explode($separator, $id))));
}
/**
* Converts any "CamelCased" into an "underscored_word".
* @param string $words the word(s) to underscore
* @return string
*/
public static function underscore($words)
{
return strtolower(preg_replace('/(?<=\\w)([A-Z])/', '_\\1', $words));
}
/**
* Returns a human-readable string from $word
* @param string $word the string to humanize
* @param boolean $ucAll whether to set all words to uppercase or not
* @return string
*/
public static function humanize($word, $ucAll = false)
{
$word = str_replace('_', ' ', preg_replace('/_id$/', '', $word));
return $ucAll ? ucwords($word) : ucfirst($word);
}
/**
* Same as camelize but first char is in lowercase.
* Converts a word like "send_email" to "sendEmail". It
* will remove non alphanumeric character from the word, so
* "who's online" will be converted to "whoSOnline"
* @param string $word to lowerCamelCase
* @return string
*/
public static function variablize($word)
{
$word = static::camelize($word);
return strtolower($word[0]) . substr($word, 1);
}
/**
* Converts a class name to its table name (pluralized)
* naming conventions. For example, converts "Person" to "people"
* @param string $className the class name for getting related table_name
* @return string
*/
public static function tableize($className)
{
return static::pluralize(static::underscore($className));
}
/**
* Returns a string with all spaces converted to given replacement,
* non word characters removed and the rest of characters transliterated.
*
* If intl extension isn't available uses fallback that converts latin characters only
* and removes the rest. You may customize characters map via $transliteration property
* of the helper.
*
* @param string $string An arbitrary string to convert
* @param string $replacement The replacement to use for spaces
* @param boolean $lowercase whether to return the string in lowercase or not. Defaults to `true`.
* @return string The converted string.
*/
public static function slug($string, $replacement = '-', $lowercase = true)
{
$string = static::transliterate($string);
$string = preg_replace('/[^a-zA-Z0-9=\s—–-]+/u', '', $string);
$string = preg_replace('/[=\s—–-]+/u', $replacement, $string);
$string = trim($string, $replacement);
return $lowercase ? strtolower($string) : $string;
}
/**
* Returns transliterated version of a string.
*
* If intl extension isn't available uses fallback that converts latin characters only
* and removes the rest. You may customize characters map via $transliteration property
* of the helper.
*
* @param string $string input string
* @param string|\Transliterator $transliterator either a [[Transliterator]] or a string
* from which a [[Transliterator]] can be built.
* @return string
* @since 2.0.7 this method is public.
*/
public static function transliterate($string, $transliterator = null)
{
if (static::hasIntl()) {
if ($transliterator === null) {
$transliterator = static::$transliterator;
}
return transliterator_transliterate($transliterator, $string);
} else {
return strtr($string, static::$transliteration);
}
}
/**
* @return boolean if intl extension is loaded
*/
protected static function hasIntl()
{
return extension_loaded('intl');
}
/**
* Converts a table name to its class name. For example, converts "people" to "Person"
* @param string $tableName
* @return string
*/
public static function classify($tableName)
{
return static::camelize(static::singularize($tableName));
}
/**
* Converts number to its ordinal English form. For example, converts 13 to 13th, 2 to 2nd ...
* @param integer $number the number to get its ordinal value
* @return string
*/
public static function ordinalize($number)
{
if (in_array($number % 100, range(11, 13))) {
return $number . 'th';
}
switch ($number % 10) {
case 1:
return $number . 'st';
case 2:
return $number . 'nd';
case 3:
return $number . 'rd';
default:
return $number . 'th';
}
}
/**
* Converts a list of words into a sentence.
*
* Special treatment is done for the last few words. For example,
*
* ```php
* $words = ['Spain', 'France'];
* echo Inflector::sentence($words);
* // output: Spain and France
*
* $words = ['Spain', 'France', 'Italy'];
* echo Inflector::sentence($words);
* // output: Spain, France and Italy
*
* $words = ['Spain', 'France', 'Italy'];
* echo Inflector::sentence($words, ' & ');
* // output: Spain, France & Italy
* ```
*
* @param array $words the words to be converted into an string
* @param string $twoWordsConnector the string connecting words when there are only two
* @param string $lastWordConnector the string connecting the last two words. If this is null, it will
* take the value of `$twoWordsConnector`.
* @param string $connector the string connecting words other than those connected by
* $lastWordConnector and $twoWordsConnector
* @return string the generated sentence
* @since 2.0.1
*/
public static function sentence(array $words, $twoWordsConnector = ' and ', $lastWordConnector = null, $connector = ', ')
{
if ($lastWordConnector === null) {
$lastWordConnector = $twoWordsConnector;
}
switch (count($words)) {
case 0:
return '';
case 1:
return reset($words);
case 2:
return implode($twoWordsConnector, $words);
default:
return implode($connector, array_slice($words, 0, -1)) . $lastWordConnector . end($words);
}
}
}
有一些例子。
echo "Inflector Test";
require('PhInflector.php');
echo "<hr>";
echo PhInflector::slug('Höäpeäöäich Médsui27:;;,.1! *"29p');
echo "<hr>";
echo PhInflector::slug('HIJO"$(/&T §!"(/&T"§:;;,.1! *"29p');
echo "<hr>";
echo PhInflector::slug('38917 jiodj d ! *"29p');
echo "<hr>";
echo PhInflector::slug('каи циефле ///!!!');
并转发 github 链接点击这里。
我认为您必须使用列表将某些特殊单词的复数转换为单数(在您的示例 Types->Type 中)。
我想你可以看看CakePHP的源代码(你可以从这里开始搜索)。他们正在为他们的表名和字段名使用这种算法来自动连接表。
[编辑:] 在这里,您可以阅读一些有关“英语中的复数变形”的科学著作
我敢肯定,您可以通过 Google 找到大量可以执行此操作的库。
但是如果你喜欢编码,你可以尝试相反的过程:从字典的单数单词开始(下载免费的,被 aspell 或其他什么使用),使用复数规则;收集映射并切换方向。对于“类型”,您将复数为“类型”,反向映射将按预期工作。虽然这里也有例外,但可靠地复数事物会稍微容易一些。我不久前(在 90 年代中期...... :-))做了这个,用于一个在线游戏(一个 MUD),其中多个相同项目的描述被连接起来,并且需要自动复数。
另外:鉴于它的表格数量有限,您可以使用最简单的算法,获取原始输出,关注它并手动修复错误案例。:-)
我要试试这个 MorphAdorner:http ://morphadorner.northwestern.edu/morphadorner/download/ (Java)。它是不同类型的 NLP 处理工具的集合,您可以通过在线示例对其进行测试。对于您的问题(这也是我的问题),有 Pluralizer 工具: http: //morphadorner.northwestern.edu/morphadorner/pluralizer/example/
我刚刚遇到这个问题并在 10 分钟内开发了一个解决方案。
我认为@paxdiablo 为构建转换引擎和添加规则提供了一个很好的思路。我建立了一个字典规则和三个通用规则。字典规则进入字典文件以查找异常情况,而三个通用规则分别处理“ies”、“es”和“s”。
但是,将所有例外添加到字典中可能需要花费太多时间,例如,pies/trees/bus 等。我为处理这些单词所做的一项改进是确保可以将其转换回来。
例如,如果我们错误地将删除“es”规则应用于“trees”并将其转换为“tre”,当尝试添加复数形式时,您将得到“tres”,它不等于原始“tree”你知道不应该应用“es”规则。这种方法可以解决上面提到的异常,而不需要将它们添加到字典文件中。
我最终得到了一个包含 42 个真正特殊单词的字典文件,它可以处理大多数情况。
在 uNnAddIns 项目中有一个很好的变形器实现,它甚至实现了一个实验性的西班牙语变形器。这个想法来自 Rails Inflector 模块。
它也可以用于其他事情,例如从 CamelCase 转换为普通文本和其他好东西,例如从标题生成浏览器友好的 URL。