在 pass 1 of 2 中,从这个正则表达式开始,将文本分解为单词和非单词字符的交替段。
将匹配项存储在这样的列表中(我使用的是 AS3 样式的伪代码,因为我不使用 ruby):
class MatchedWord
var text:String;
var charIndex:int;
var isWord:Boolean;
var isContraction:Boolean = false;
function MatchedWord( text:String, charIndex:int, isWord:Boolean )
this.text = text; this.charIndex = charIndex; this.isWord = isWord;
var match:Object;
var matched_word:MatchedWord;
var matched_words:Vector.<MatchedWord> = new Vector.<MatchedWord>();
var words_regex:RegExp = /(\w+)|(\W+)/gi
words_regex.lastIndex = 0; //this is where to start looking for matches, and is updated to the end of the last match each time exec is called
while ((match = words_regex.exec( original_text )) != null)
matched_words.push( new MatchedWord( match[0], match.index, match[1] != null ) ); //match[0] is the entire match and match[1] is the first parenthetical group (if it's null, then it's not a word and match[2] would be non-null)
在 pass 2 of 2 中,遍历匹配列表以通过检查每个(修剪的、非单词)匹配是否以撇号结尾来查找缩写。如果是,则检查下一个相邻(单词)匹配,看它是否匹配仅有的 8 个常见收缩结尾之一。尽管我能想到所有的两部分收缩,但只有 8 个共同的结尾。
一旦您确定了这样的一对匹配项 (non-word)="'" 和 (word)="d",那么您只需包含前面的相邻 (word) 匹配项并将三个匹配项连接起来即可得到您的收缩。
您可能需要进行的另一项修改是扩展 8 个词尾的列表,以包括诸如“g'day”和“g'night”之类的整个词的词尾。同样,这是一个简单的修改,涉及对前面(单词)匹配的条件检查。如果它是“g”,那么你包括它。
Condition(Ending, PreCondition)
"*", "!", or "<exact string>"
new Condition("d","*") //if apostrophe d is found, include the preceding word string and count as successful contraction match
new Condition("l","*");
new Condition("ll","*");
new Condition("m","*");
new Condition("re","*");
new Condition("s","*");
new Condition("t","*");
new Condition("ve","*");
new Condition("twas","!"); //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
new Condition("tis","!");
new Condition("day","g"); //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
new Condition("night","g");
如果你只是按照我解释的那样处理这些条件,那应该涵盖所有这 86 个收缩(以及更多):
这不是 不是 不能 不能 不能 不能 不能 不是 不是 每个人的日子他是如何 我会 我会 我会 我会不会她会 她会 她应该会 不应该 那会't what''d 为什么会为什么不会
在旁注中,不要忘记不使用撇号的俚语收缩,例如“gotta”>“got to”和“gonna”>“going to”。
这是最终的 AS3 代码。总体而言,您只需不到 50 行代码即可将文本解析为交替的单词和非单词组,并识别和合并缩略语。简单的。您甚至可以在 MatchedWord 类中添加一个布尔“isContraction”变量,并在识别出收缩时在下面的代码中设置标志。
//Automatically merge known contractions
var conditions:Array = [
["d","*"], //if apostrophe d is found, include the preceding word string and count as successful contraction match
["twas","!"], //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
["day","g"], //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
for (i = 0; i < matched_words.length - 1; i++) //not a type-o, intentionally stopping at next to last index to avoid a condition check in the loop
var m:MatchedWord = matched_words[i];
var apostrophe_text:String = StringUtils.trim( m.text ); //check if this ends with an apostrophe first, then deal more closely with it
if (!m.isWord && StringUtils.endsWith( apostrophe_text, "'" ))
var m_next:MatchedWord = matched_words[i + 1]; //no bounds check necessary, since loop intentionally stopped at next to last index
var m_prev:MatchedWord = ((i - 1) >= 0) ? matched_words[i - 1] : null; //bounds check necessary for previous match, since we're starting at beginning, since we may or may not need to look at the prior match depending on the precondition
for each (var condition:Array in conditions)
if (StringUtils.trim( m_next.text ) == condition[0])
var pre_condition:String = condition[1];
switch (pre_condition)
case "*": //success after one final check, include prior match, merge current and next match into prior match and delete current and next match
if (m_prev != null && apostrophe_text == "'") //EQUAL apostrophe, not just ENDS with apostrophe
m_prev.text += m.text + m_next.text;
m_prev.isContraction = true;
matched_words.splice( i, 2 );
case "!": //success after one final check, do not include prior match, merge current and next match, and delete next match
if (apostrophe_text == "'")
m.text += m_next.text;
m.isWord = true; //match now includes word text so flip it to a "word" block for logical consistency
m.isContraction = true;
matched_words.splice( i + 1, 1 );
{ //strip apostrophe off end and merge with next item, nothing needs deleted
//preserve spaces and match start indexes by manipulating untrimmed strings
var apostrophe_end:int = m.text.lastIndexOf( "'" );
var apostrophe_ending:String = m.text.substring( apostrophe_end, m.text.length );
m.text = m.text.substring( 0, m.text.length - apostrophe_ending.length); //strip apostrophe and any trailing spaces
m_next.text = apostrophe_ending + m_next.text;
m_next.charIndex = m.charIndex + apostrophe_end;
m_next.isContraction = true;
default: //conditional success, check prior match meets condition
if (m_prev != null && m_prev.text == pre_condition)
m_prev.text += m.text + m_next.text;
m_prev.isContraction = true;
matched_words.splice( i, 2 );