3

我正在尝试解析文本块,并且需要一种方法来检测不同上下文中撇号之间的差异。拥有和缩写在一组中,在另一组中的引用。

例如

“我是汽车的所有者”-> [“我是”、“该”、“汽车”、“所有者”]

“他说‘你好’” -> [“他”,“说”,“你好””]

检测两边的空白将无济于事,因为像“'ello”和“cars'”这样的东西会被解析为引号的一端,与匹配的撇号对相同。我感觉除了极其复杂的 NLP 解决方案之外别无他法,我将不得不忽略任何没有出现在单词中间的撇号,这将是不幸的。

编辑:

自从写作以来,我意识到这是不可能的。任何基于正则表达式的解析器都必须解析:

“你好,我的伙伴们”的狗

以两种不同的方式,并且只能在理解句子的其余部分的情况下做到这一点。猜猜我赞成忽略最不可能的情况并希望它足够罕见而只会导致不常见的异常的不雅解决方案。

4

3 回答 3

0

使用非常简单的两阶段过程。

在 pass 1 of 2 中,从这个正则表达式开始,将文本分解为单词和非单词字符的交替段。

/(\w+)|(\W+)/gi

将匹配项存储在这样的列表中(我使用的是 AS3 样式的伪代码,因为我不使用 ruby​​):

class MatchedWord
{
    var text:String;
    var charIndex:int;
    var isWord:Boolean;
    var isContraction:Boolean = false;
    function MatchedWord( text:String, charIndex:int, isWord:Boolean )
    {
        this.text = text; this.charIndex = charIndex; this.isWord = isWord;
    }
}
var match:Object;
var matched_word:MatchedWord;
var matched_words:Vector.<MatchedWord> = new Vector.<MatchedWord>();
var words_regex:RegExp = /(\w+)|(\W+)/gi
words_regex.lastIndex = 0; //this is where to start looking for matches, and is updated to the end of the last match each time exec is called
while ((match = words_regex.exec( original_text )) != null)
    matched_words.push( new MatchedWord( match[0], match.index, match[1] != null ) ); //match[0] is the entire match and match[1] is the first parenthetical group (if it's null, then it's not a word and match[2] would be non-null)

在 pass 2 of 2 中,遍历匹配列表以通过检查每个(修剪的、非单词)匹配是否以撇号结尾来查找缩写。如果是,则检查下一个相邻(单词)匹配,看它是否匹配仅有的 8 个常见收缩结尾之一。尽管我能想到所有的两部分收缩,但只有 8 个共同的结尾。

d
l
ll
m
re
s
t
ve

一旦您确定了这样的一对匹配项 (non-word)="'" 和 (word)="d",那么您只需包含前面的相邻 (word) 匹配项并将三个匹配项连接起来即可得到您的收缩。

了解刚刚描述的过程,您必须进行的一项修改是扩展该收缩结尾列表以包括以撇号开头的收缩,例如“'twas”和“'tis”。对于那些,您根本不连接前面的相邻(单词)匹配,并且您更仔细地查看撇号匹配以查看它之前是否包含其他非单词字符(这就是为什么它以撇号结尾很重要)。如果修剪后的字符串等于撇号,则将其与下一个匹配项合并,如果它仅以撇号结尾,则剥离撇号并将其​​与下一个匹配项合并。同样,包含先前匹配的条件应首先检查以确保以撇号结尾的(修剪的非单词)匹配等于撇号,

您可能需要进行的另一项修改是扩展 8 个词尾的列表,以包括诸如“g'day”和“g'night”之类的整个词的词尾。同样,这是一个简单的修改,涉及对前面(单词)匹配的条件检查。如果它是“g”,那么你包括它。

该过程应该捕获大多数收缩,并且足够灵活,可以包含您能想到的新收缩。

数据结构看起来像这样。

Condition(Ending, PreCondition)

前提条件是

"*", "!", or "<exact string>"

最终的条件列表如下所示:

new Condition("d","*") //if apostrophe d is found, include the preceding word string and count as successful contraction match
new Condition("l","*");
new Condition("ll","*");
new Condition("m","*");
new Condition("re","*");
new Condition("s","*");
new Condition("t","*");
new Condition("ve","*");
new Condition("twas","!"); //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
new Condition("tis","!");
new Condition("day","g"); //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
new Condition("night","g");

如果你只是按照我解释的那样处理这些条件,那应该涵盖所有这 86 个收缩(以及更多):

这不是 不是 不能 不能 不能 不能 不能 不是 不是 每个人的日子他是如何 我会 我会 我会 我会不会她会 她会 她应该会 不应该 那会't what''d 为什么会为什么不会

在旁注中,不要忘记不使用撇号的俚语收缩,例如“gotta”>“got to”和“gonna”>“going to”。

这是最终的 AS3 代码。总体而言,您只需不到 50 行代码即可将文本解析为交替的单词和非单词组,并识别和合并缩略语。简单的。您甚至可以在 MatchedWord 类中添加一个布尔“isContraction”变量,并在识别出收缩时在下面的代码中设置标志。

//Automatically merge known contractions
var conditions:Array = [
    ["d","*"], //if apostrophe d is found, include the preceding word string and count as successful contraction match
    ["l","*"],
    ["ll","*"],
    ["m","*"],
    ["re","*"],
    ["s","*"],
    ["t","*"],
    ["ve","*"],
    ["twas","!"], //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
    ["tis","!"],
    ["day","g"], //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
    ["night","g"]
];
for (i = 0; i < matched_words.length - 1; i++) //not a type-o, intentionally stopping at next to last index to avoid a condition check in the loop
{
    var m:MatchedWord = matched_words[i];
    var apostrophe_text:String = StringUtils.trim( m.text ); //check if this ends with an apostrophe first, then deal more closely with it
    if (!m.isWord && StringUtils.endsWith( apostrophe_text, "'" ))
    {
        var m_next:MatchedWord = matched_words[i + 1]; //no bounds check necessary, since loop intentionally stopped at next to last index
        var m_prev:MatchedWord = ((i - 1) >= 0) ? matched_words[i - 1] : null; //bounds check necessary for previous match, since we're starting at beginning, since we may or may not need to look at the prior match depending on the precondition
        for each (var condition:Array in conditions)
        {
            if (StringUtils.trim( m_next.text ) == condition[0])
            {
                var pre_condition:String = condition[1];
                switch (pre_condition)
                {
                    case "*": //success after one final check, include prior match, merge current and next match into prior match and delete current and next match
                        if (m_prev != null && apostrophe_text == "'") //EQUAL apostrophe, not just ENDS with apostrophe
                        {
                            m_prev.text += m.text + m_next.text;
                            m_prev.isContraction = true;
                            matched_words.splice( i, 2 );
                        }
                        break;
                    case "!": //success after one final check, do not include prior match, merge current and next match, and delete next match
                        if (apostrophe_text == "'")
                        {
                            m.text += m_next.text;
                            m.isWord = true; //match now includes word text so flip it to a "word" block for logical consistency
                            m.isContraction = true;
                            matched_words.splice( i + 1, 1 );
                        }
                        else
                        {   //strip apostrophe off end and merge with next item, nothing needs deleted
                            //preserve spaces and match start indexes by manipulating untrimmed strings
                            var apostrophe_end:int = m.text.lastIndexOf( "'" );
                            var apostrophe_ending:String = m.text.substring( apostrophe_end, m.text.length );
                            m.text = m.text.substring( 0, m.text.length - apostrophe_ending.length); //strip apostrophe and any trailing spaces
                            m_next.text = apostrophe_ending + m_next.text;
                            m_next.charIndex = m.charIndex + apostrophe_end;
                            m_next.isContraction = true;
                        }
                        break;
                    default: //conditional success, check prior match meets condition
                        if (m_prev != null && m_prev.text == pre_condition)
                        {
                            m_prev.text += m.text + m_next.text;
                            m_prev.isContraction = true;
                            matched_words.splice( i, 2 );
                        }
                        break;
                }
            }
        }
    }
}
于 2012-10-17T17:58:29.900 回答
0

嗯,恐怕这并不容易。这是一个有点作用的正则表达式,唉,只适用于“I'm”和“I've”之类的东西:

>> s1 =~ /[\w\s]*((?<!I)'(?:[^']+)')[\w\s]*/
=> nil
>> s2 =~ /[\w\s]*((?<!I)'(?:[^']+)')[\w\s]*/
=> 0
>> $1
=> "'hello there'"

如果你多玩一点,你也许可以消除一些其他常见的宫缩,但总比没有好。

于 2012-05-09T21:52:11.613 回答
0

需要考虑的一些规则:

  • 引号将以带有空格字符的撇号开头,或者在它之前没有任何内容。
  • 引号将以带标点符号的撇号或后面的空格字符结尾。
  • 有些词可能看起来像引号的结尾,例如,peoples'
  • 引号分隔的撇号永远不会在它们之前和之后直接有字母。
于 2012-05-10T04:35:24.073 回答