2

基本上我正在接受一个充满各种标点符号的段落,例如!? . ; " 并将它们拆分成句子。我面临的问题是想出一种方法,将它们拆分成完整的标点符号的句子,同时考虑对话中的引用

例如以下段落:

一天早上,当格里高尔·萨姆萨从噩梦中醒来时,他发现自己在床上变成了一只可怕的害虫。“发生了什么!?” 他问自己。“我不知道。” 萨姆沙说,“也许这是个噩梦。” 他躺在盔甲般的背上,稍微抬起头,就能看到他棕色的腹部,略呈拱形,被拱形分成僵硬的部分。

需要像这样分开

[0] One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. 

[1] "What has happened!?" he asked himself.

[2] "I... don't know." said Samsa, "Maybe this is a bad dream."

等等。

目前我只是在使用爆炸

$sentences = explode(".", $sourceWork);

并且仅按句点将其拆分并在末尾附加一个。我知道这与我想要的相去甚远,但我不太确定从哪里开始处理这个问题。如果有人至少能指出我在哪里寻找想法的正确方向,那将是惊人的。

提前致谢!

4

3 回答 3

3

这是我所拥有的:

<?php

/**
 * @param string $str                          String to split
 * @param string $end_of_sentence_characters   Characters which represent the end of the sentence. Should be a string with no spaces (".,!?")
 *
 * @return array
 */
function split_sentences($str, $end_of_sentence_characters) {
    $inside_quotes = false;
    $buffer = "";
    $result = array();
    for ($i = 0; $i < strlen($str); $i++) {
        $buffer .= $str[$i];
        if ($str[$i] === '"') {
            $inside_quotes = !$inside_quotes;
        }
        if (!$inside_quotes) {
            if (preg_match("/[$end_of_sentence_characters]/", $str[$i])) {
                $result[] = $buffer;
                $buffer = "";
            }
        }
    }
    return $result;
}

$str = <<<STR
One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. "What has happened!?" he asked himself. "I... don't know." said Samsa, "Maybe this is a bad dream." He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.
STR;

var_dump(split_sentences($str, "."));
于 2012-09-22T21:03:14.173 回答
0
preg_split('/[.?!]/',$sourceWork);

这是非常简单的正则表达式,但我认为你的任务是不可能的。

于 2012-09-22T20:48:29.077 回答
0

您需要手动检查您的字符串并进行爆炸。跟踪引用计数,如果是奇数不要中断,这是一个简单的想法:

    <?
//$str = 'AAA. BBB. "CCC." DDD. EEE. "FFF. GGG. HHH".';
$str = 'One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin. "What has happened!?" he asked himself. "I... don\'t know." said Samsa, "Maybe this is a bad dream." He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.';
$last_dot=0;
$quotation=0;
$explode_list = Array();
for($i=0;$i < strlen($str);$i++)
{
    $char = substr($str,$i,1);//get the currect character
    if($char == '"') $quotation++;//track quotation
    if($quotation%2==1) continue;//nothing to do so go back

    if($char == '.')
    {
        echo "char is $char $last_dot<br/>";
         $explode_list[]=(substr($str,$last_dot,$i+1-$last_dot));
         $last_dot = $i+1;
    }
}

echo "testing:<pre>";
print_r($explode_list);;
于 2012-09-22T20:57:40.047 回答