10

这个问题几乎与如何将结构化文本文件转换为 PHP 多维数组重复,但我再次发布它,因为我无法理解给出的基于正则表达式的解决方案。尝试使用 PHP 来解决这个问题似乎更好,这样我就可以真正从中学习(此时正则表达式太难理解)。

假设以下文本文件:

HD Alcoa Earnings Soar; Outlook Stays Upbeat 
BY By James R. Hagerty and Matthew Day 
PD 12 July 2011
LP 

Alcoa Inc.'s profit more than doubled in the second quarter.
The giant aluminum producer managed to meet analysts' forecasts.

However, profits wereless than expected

TD
Licence this article via our website:

http://example.com

我用 PHP 阅读了这个文本文件,需要一种将文件内容放入数组的可靠方法,如下所示:

array(
  [HD] => Alcoa Earnings Soar; Outlook Stays Upbeat,
  [BY] => By James R. Hagerty and Matthew Day,
  [PD] => 12 July 2011,
  [LP] => Alcoa Inc.'s profit...than expected,
  [TD] => Licence this article via our website: http://example.com
)

这些词HD BY PD LP TD是标识文件中新部分的键。在数组中,所有换行符都可以从值中删除。理想情况下,我可以在没有正则表达式的情况下做到这一点。我相信在所有键上爆炸可能是一种方法,但它会很脏:

$fields = array('HD', 'BY', 'PD', 'LP', 'TD');
$parts = explode($text, "\nHD ");
$HD = $parts[0];

是否有人对如何循环遍历文本有更清晰的想法,甚至可能一次,并将其划分为上面给出的数组?

4

9 回答 9

14

这是另一种更短的方法,不使用正则表达式。

/**
 * @param  array  array of stopwords eq: array('HD', 'BY', ...)
 * @param  string Text to search in
 * @param  string End Of Line symbol
 * @return array  [ stopword => string, ... ]
 */
function extract_parts(array $parts, $str, $eol=PHP_EOL) {
  $ret=array_fill_keys($parts, '');
  $current=null;
  foreach(explode($eol, $str) AS $line) {
    $substr = substr($line, 0, 2);
    if (isset($ret[$substr])) {
      $current = $substr;
      $line = trim(substr($line, 2));
    }
    if ($current) $ret[$current] .= $line;
  }
  return $ret;
}

$ret = extract_parts(array('HD', 'BY', 'PD', 'LP', 'TD'), $str);
var_dump($ret);

为什么不使用正则表达式?

由于 php 文档,特别是 preg_* 函数,建议在非强烈要求时不要使用正则表达式。我想知道这个问题的答案中的哪些示例具有最佳性能。

结果令我惊讶:

Answer 1 by: hek2mgl     2.698 seconds (regexp)
Answer 2 by: Emo Mosley  2.38  seconds
Answer 3 by: anubhava    3.131 seconds (regexp)
Answer 4 by: jgb         1.448 seconds

我原以为正则表达式变体会是最快的。

好吧,无论如何不使用正则表达式并不是一件坏事。换句话说:使用正则表达式通常不是最好的解决方案。您必须根据具体情况决定最佳解决方案。

您可以使用此脚本重复测量。


编辑

这是一个使用正则表达式模式的简短、更优化的示例。仍然不如我上面的示例快,但比其他基于正则表达式的示例快。

可以优化输出格式(空格/换行符)。

function extract_parts_regexp($str) {
  $a=array();
  preg_match_all('/(?<k>[A-Z]{2})(?<v>.*?)(?=\n[A-Z]{2}|$)/Ds', $str, $a);
  return array_combine($a['k'], $a['v']);
}
于 2013-08-23T19:47:07.287 回答
8

代表简化、快速和可读的正则表达式代码的请求!

(来自评论中的 Pr0no)您认为您可以简化正则表达式或对如何开始使用 php 解决方案有提示吗?是的,Pr0n0,我相信我可以简化正则表达式。

我想证明正则表达式是迄今为止最好的工作工具,并且它不必像我们之前看到的那样是可怕且难以理解的表达式。我已将此功能分解为可理解的部分。

我已经避免了复杂的正则表达式功能,例如捕获组和通配符表达式,而是专注于尝试生成一些简单的东西,让你在 3 个月后回到这里会感到很舒服。

我提议的功能(已评论)

function headerSplit($input) {

    // First, let's put our headers (any two consecutive uppercase characters at the start of a line) in an array
    preg_match_all(
        "/^[A-Z]{2}/m",       /* Find 2 uppercase letters at start of a line   */
        $input,               /* In the '$input' string                        */
        $matches              /* And store them in a $matches array            */
    );

    // Next, let's split our string into an array, breaking on those headers
    $split = preg_split(
        "/^[A-Z]{2}/m",       /* Find 2 uppercase letters at start of a line   */
        $input,               /* In the '$input' string                        */
        null,                 /* No maximum limit of matches                   */
        PREG_SPLIT_NO_EMPTY   /* Don't give us an empty first element          */
    );

    // Finally, put our values into a new associative array
    $result = array();
    foreach($matches[0] as $key => $value) {
        $result[$value] = str_replace(
            "\r\n",              /* Search for a new line character            */
            " ",                 /* And replace with a space                   */
            trim($split[$key])   /* After trimming the string                  */
        );
    }

    return $result;
}

和输出(注意:根据您的操作系统,您可能需要\r\n\ninstr_replace函数替换) :

array(5) {
  ["HD"]=> string(41) "Alcoa Earnings Soar; Outlook Stays Upbeat"
  ["BY"]=> string(35) "By James R. Hagerty and Matthew Day"
  ["PD"]=> string(12) "12 July 2011"
  ["LP"]=> string(172) "Alcoa Inc.'s profit more than doubled in the second quarter.  The giant aluminum producer managed to meet analysts' forecasts.    However, profits wereless than expected"
  ["TD"]=> string(59) "Licence this article via our website:    http://example.com"
}

删除更清洁函数的注释

此函数的精简版。它与上面完全相同,但删除了注释:

function headerSplit($input) {
    preg_match_all("/^[A-Z]{2}/m",$input,$matches);
    $split = preg_split("/^[A-Z]{2}/m",$input,null,PREG_SPLIT_NO_EMPTY);
    $result = array();
    foreach($matches[0] as $key => $value) $result[$value] = str_replace("\r\n"," ",trim($split[$key]));
    return $result;
}

从理论上讲,您在实时代码中使用哪一个并不重要,因为解析注释对性能影响很小,因此请使用您更熟悉的那个。

此处使用的正则表达式的细分

函数中只有一个表达式(尽管使用了两次),为简单起见,我们将其分解:

"/^[A-Z]{2}/m"

/     - This is a delimiter, representing the start of the pattern.
^     - This means 'Match at the beginning of the text'.
[A-Z] - This means match any uppercase character.
{2}   - This means match exactly two of the previous character (so exactly two uppercase characters).
/     - This is the second delimiter, meaning the pattern is over.
m     - This is 'multi-line mode', telling regex to treat each line as a new string.

这个微小的表达式足够强大,可以匹配HD但不是HDM在行首,也不HD是(例如 in Full HD)在行的中间。使用非正则表达式选项将无法轻松实现这一点。

如果您想要两个或更多(而不是正好 2 个)连续的大写字符来表示一个新部分,请使用/^[A-Z]{2,}/m.

使用预定义标题列表

阅读了您的最后一个问题以及您在@jgb 帖子下的评论后,您似乎想要使用预定义的标题列表。您可以通过将我们的正则表达式替换为"/^(HD|BY|WC|PD|SN|SC|PG|LA|CY|LP|TD|CO|IN|NS|RE|IPC|PUB|AN)/m-- 在正则表达式中|被视为“或”来做到这一点。

基准测试 - 可读并不意味着慢

不知何故,基准测试已成为对话的一部分,尽管我认为它没有为您提供可读和可维护的解决方案,但我重写了 JGB 的基准测试以向您展示一些东西

这是我的结果,表明这个基于正则表达式的代码是这里最快的选择(这些结果基于 5,000 次迭代):

SWEETIE BELLE'S SOLUTION (2 UPPERCASE IS A HEADER):         0.054 seconds
SWEETIE BELLE'S SOLUTION (2+ UPPERCASE IS A HEADER):        0.057 seconds
MATEWKA'S SOLUTION (MODIFIED, 2 UPPERCASE IS A HEADER):     0.069 seconds
BABA'S SOLUTION (2 UPPERCASE IS A HEADER):                  0.075 seconds
SWEETIE BELLE'S SOLUTION (USES DEFINED LIST OF HEADERS):    0.086 seconds
JGB'S SOLUTION (USES DEFINED LIST OF HEADERS, MODIFIED):    0.107 seconds

以及输出格式不正确的解决方案的基准:

MATEWKA'S SOLUTION:                                         0.056 seconds
JGB'S SOLUTION:                                             0.061 seconds
HEK2MGL'S SOLUTION:                                         0.106 seconds
ANUBHAVA'S SOLUTION:                                        0.167 seconds

我提供 JGB 函数的修改版本的原因是因为他的原始函数在将段落添加到输出数组之前不会删除换行符。小字符串操作会对性能产生巨大影响,必须平等地进行基准测试才能获得公平的性能估计。

此外,使用 jgb 的功能,如果您传入完整的标题列表,您将在数组中获得一堆空值,因为它似乎不会在分配之前检查密钥是否存在。如果您想稍后循环这些值,这将导致另一个性能下降,因为您必须先检查empty

于 2013-08-30T10:35:38.347 回答
6

这是一个没有正则表达式的简单解决方案

$data = explode("\n", $str);
$output = array();
$key = null;

foreach($data as $text) {
    $newKey = substr($text, 0, 2);
    if (ctype_upper($newKey)) {
        $key = $newKey;
        $text = substr($text, 2);
    }
    $text = trim($text);
    isset($output[$key]) ? $output[$key] .= $text : $output[$key] = $text;
}
print_r($output);

输出

Array
(
    [HD] => Alcoa Earnings Soar; Outlook Stays Upbeat
    [BY] => By James R. Hagerty and Matthew Day
    [PD] => 12 July 2011
    [LP] => Alcoa Inc.'s profit more than doubled in the second quarter.The giant aluminum producer managed to meet analysts' forecasts.However, profits wereless than expected
    [TD] => Licence this article via our website:http://example.com
)

观看现场演示

笔记

您可能还想执行以下操作:

  • 检查重复数据
  • 确保只HD|BY|PD|LP|TD使用
  • 删除$text = trim($text)以便新行将保留在文本中
于 2013-08-25T20:47:13.213 回答
5

如果每个文件只有一条记录,请执行以下操作:

$record = array();
foreach(file('input.txt') as $line) {
    if(preg_match('~^(HD|BY|PD|LP|TD) ?(.*)?$~', $line, $matches)) {
        $currentKey = $matches[1];
        $record[$currentKey] = $matches[2];
    } else {
        $record[$currentKey] .= str_replace("\n", ' ', $line);
    }   
}

代码遍历输入的每一行并检查该行是否以标识符开头。如果是,currentKey则设置为此标识符。删除新行后,除非找到新标识符,否则所有后续内容都将添加到数组中的此键中。

var_dump($record);

输出:

array(5) {
  'HD' =>
  string(42) "Alcoa Earnings Soar; Outlook Stays Upbeat "
  'BY' =>
  string(36) "By James R. Hagerty and Matthew Day "
  'PD' =>
  string(12) "12 July 2011"
  'LP' =>
  string(169) " Alcoa Inc.'s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts' forecasts.  However, profits wereless than expected  "
  'TD' =>
  string(58) "Licence this article via our website:  http://example.com "
}

注意:如果每个文件有多个记录,您可以优化解析器以返回多维数组:

$records = array();
foreach(file('input.txt') as $line) {
    if(preg_match('~^(HD|BY|PD|LP|TD) ?(.*)?$~', $line, $matches)) {
        $currentKey = $matches[1];

        // start a new record if `HD` was found.
        if($currentKey === 'HD') {
            if(is_array($record)) {
                $records []= $record;
            }
            $record = array();
        }
        $record[$currentKey] = $matches[2];
    } else {
        $record[$currentKey] .= str_replace("\n", ' ', $line);
    }   
}

然而,数据格式本身对我来说看起来很脆弱。如果 LP 看起来像这样:

LP dfks ldsfjksdjlf
lkdsjflk dsfjksld..
HD defsdf sdf sd....

你看,在我的例子中,LP 的数据中有一个 HD。为了保持数据可解析,您必须避免这种情况。

于 2013-08-23T15:45:10.500 回答
5

更新 :

鉴于发布的示例输入文件和代码,我改变了我的答案。我添加了 OP 提供的“部分”,这些“部分”定义了部分代码并使函数能够处理 2 位或更多位代码。下面是一个非正则表达式程序函数,它应该产生预期的结果:

# Parses the given text file and populates an array with coded sections.
# INPUT:
#   filename = (string) path and filename to text file to parse
# RETURNS: (assoc array)
#   null is returned if there was a file error or no data was found
#   otherwise an associated array of the field sections is returned
function getSections($parts, $lines) {
   $sections = array();
   $code = "";
   $str = "";
   # examine each line to build section array
   for($i=0; $i<sizeof($lines); $i++) {
      $line = trim($lines[$i]);
      # check for special field codes
      $words = explode(' ', $line, 2);
      $left = $words[0];
      #echo "DEBUG: left[$left]\n";
      if(in_array($left, $parts)) {
         # field code detected; first, finish previous section, if exists
         if($code) {
            # store the previous section
            $sections[$code] = trim($str);
         }
         # begin to process new section
         $code = $left;
         $str = trim(substr($line, strlen($code)));
      } else if($code && $line) {
         # keep a running string of section content
         $str .= " ".$line;
      }
   } # for i
   # check for no data
   if(!$code)
      return(null);
   # store the last section and return results
   $sections[$code] = trim($str);
   return($sections);
} # getSections()


$parts = array('HD', 'BY', 'WC', 'PD', 'SN', 'SC', 'PG', 'LA', 'CY', 'LP', 'TD', 'CO', 'IN', 'NS', 'RE', 'IPC', 'PUB', 'AN');

$datafile = $argv[1]; # NOTE: I happen to be testing this from command-line
# load file as array of lines
$lines = file($datafile);
if($lines === false)
   die("ERROR: unable to open file ".$datafile."\n");
$data = getSections($parts, $lines);
echo "Results from ".$datafile.":\n";
if($data)
   print_r($data);
else
   echo "ERROR: no data detected in ".$datafile."\n";

结果:

Array
(   
    [HD] => Alcoa Earnings Soar; Outlook Stays Upbeat
    [BY] => By James R. Hagerty and Matthew Day
    [PD] => 12 July 2011
    [LP] => Alcoa Inc.'s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts' forecasts. However, profits wereless than expected
    [TD] => Licence this article via our website: http://example.com
)
于 2013-08-23T15:47:26.933 回答
3

考虑到解析输入输出数据的规则,这是我认为使用正则表达式不应该成为问题的一个问题。考虑这样的代码:

$s = file_get_contents('input'); // read input file into a string
$match = array(); // will hold final output
if (preg_match_all('~(^|[A-Z]{2})\s(.*?)(?=[A-Z]{2}\s|$)~s', $s, $arr)) {
    for ( $i = 0; $i < count($arr[1]); $i++ )
       $match[ trim($arr[1][$i]) ] = str_replace( "\n", "", $arr[2][$i] );
}
print_r($match);

正如您所看到的,由于preg_match_all使用了匹配来自输入文件的数据的方式,代码变得紧凑。

输出:

Array
(
    [HD] => Alcoa Earnings Soar; Outlook Stays Upbeat 
    [BY] => By James R. Hagerty and Matthew Day 
    [PD] => 12 July 2011
    [LP] => Alcoa Inc.'s profit more than doubled in the second quarter.The giant aluminum producer managed to meet analysts' forecasts.However, profits wereless than expected
    [TD] => Licence this article via our website:http://example.com
)
于 2013-08-23T16:21:23.443 回答
2

根本不要循环。这个怎么样(假设每个文件一条记录)?

$inrec = file_get_contents('input');
$inrec = str_replace( "\n'", "'", str_replace( array( 'HD ', 'BY ', 'PD ', 'LP', 'TD' ), array( "'HD' => '", "','BY' => '", "','PD' => '", "','LP' => '", "','TD' => '" ), str_replace( "'", "\\'", $inrec ) ) )."'";
eval( '$record = array('.$inrec.');' );
var_export($record);

结果:

array (
  'HD' => 'Alcoa Earnings Soar; Outlook Stays Upbeat ',
  'BY' => 'By James R. Hagerty and Matthew Day ',
  'PD' => '12 July 2011',
  'LP' => ' 

Alcoa Inc.\'s profit more than doubled in the second quarter.
The giant aluminum producer managed to meet analysts\' forecasts.

However, profits wereless than expected
',
  'TD' => '
Licence this article via our website:

http://example.com',
)

如果每个文件的记录多于记录,请尝试以下操作:

$inrecs = explode( 'HD ', file_get_contents('input') );
$records = array();
foreach ( $inrecs as $inrec ) {
   $inrec = str_replace( "\n'", "'", str_replace( array( 'HD ', 'BY ', 'PD ', 'LP', 'TD' ), array( "'HD' => '", "','BY' => '", "','PD' => '", "','LP' => '", "','TD' => '" ), str_replace( "'", "\\'", 'HD ' . $inrec ) ) )."'";
   eval( '$records[] = array('.$inrec.');' );
}
var_export($records);

编辑

这是一个将 $inrec 函数分开的版本,因此可以更容易理解 - 并进行了一些调整:去除换行符,修剪前导和尾随空格,并解决 EVAL 中的反斜杠问题,以防数据来自不受信任资源。

$inrec = file_get_contents('input');
$inrec = str_replace( '\\', '\\\\', $inrec );       // Preceed all backslashes with backslashes
$inrec = str_replace( "'", "\\'", $inrec );         // Precede all single quotes with backslashes
$inrec = str_replace( PHP_EOL, " ", $inrec );       // Replace all new lines with spaces
$inrec = str_replace( array( 'HD ', 'BY ', 'PD ', 'LP ', 'TD ' ), array( "'HD' => trim('", "'),'BY' => trim('", "'),'PD' => trim('", "'),'LP' => trim('", "'),'TD' => trim('" ), $inrec )."')";
eval( '$record = array('.$inrec.');' );
var_export($record);

结果:

array (
  'HD' => 'Alcoa Earnings Soar; Outlook Stays Upbeat',
  'BY' => 'By James R. Hagerty and Matthew Day',
  'PD' => '12 July 2011',
  'LP' => 'Alcoa Inc.\'s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts\' forecasts.  However, profits wereless than expected',
  'TD' => 'Licence this article via our website:  http://example.com',
)
于 2013-08-24T23:08:36.317 回答
1

我准备了自己的解决方案,它比jgb 的回答更快。这是代码:

function answer_5(array $parts, $str) {
    $result = array_fill_keys($parts, '');
    $poss = $result;
    foreach($poss as $key => &$val) {
        $val = strpos($str, "\n" . $key) + 2;
    }

    arsort($poss);

    foreach($poss as $key => $pos) {
        $result[$key] = trim(substr($str, $pos+1));
        $str = substr($str, 0, $pos-1);
    }
    return str_replace("\n", "", $result);
}

这是性能的比较:

Answer 1 by: hek2mgl    2.791 seconds (regexp) 
Answer 2 by: Emo Mosley 2.553 seconds 
Answer 3 by: anubhava   3.087 seconds (regexp) 
Answer 4 by: jgb        1.53  seconds 
Answer 5 by: matewka    1.403 seconds

测试环境与 jgb 相同(100000 次迭代 - 从这里借用的脚本)。

享受并请留下评论。

于 2013-08-30T14:16:09.440 回答
1

更新

我突然意识到,在多记录场景中,在记录循环之外构建 $repl 会表现得更好。这是 2 字节关键字版本:

$inrecs = file_get_contents('input');
$inrecs = str_replace( PHP_EOL, " ", $inrecs );
$keys  = array( 'HD', 'BY', 'PD', 'LP', 'TD' );
$split = chr(255);
$repl = explode( ',', $split . implode( ','.$split, $keys ) );
$inrecs = explode( 'HD ', $inrecs );
array_shift( $inrecs );
$records = array();
foreach( $inrecs as $inrec ) $records[] = parseRecord( $keys, $repl, 'HD '.$inrec );

function parseRecord( $keys, $repl, $rec ) {
    $split = chr(255);
    $lines = explode( $split, str_replace( $keys, $repl, $rec ) );
    array_shift( $lines );
    $out = array();
    foreach ( $lines as $line ) $out[ substr( $line, 0, 2 ) ] = trim( substr( $line, 3 ) );
    return $out;
}

基准(感谢@jgb):

Answer 1 by: hek2mgl     6.783 seconds (regexp)
Answer 2 by: Emo Mosley  4.738 seconds
Answer 3 by: anubhava    6.299 seconds (regexp)
Answer 4 by: jgb         2.47 seconds
Answer 5 by: gwc         3.589 seconds (eval)
Answer 6 by: gwc         1.871 seconds

这是多个输入记录的另一个答案(假设每个记录以“HD”开头)并支持 2 字节、2 或 3 字节或可变长度关键字。

$inrecs = file_get_contents('input');
$inrecs = str_replace( PHP_EOL, " ", $inrecs );
$keys  = array( 'HD', 'BY', 'PD', 'LP', 'TD' );
$inrecs = explode( 'HD ', $inrecs );
array_shift( $inrecs );
$records = array();
foreach( $inrecs as $inrec ) $records[] = parseRecord( $keys, 'HD '.$inrec );

使用 2 字节关键字解析记录:

function parseRecord( $keys, $rec ) {
    $split = chr(255);
    $repl = explode( ',', $split . implode( ','.$split, $keys ) );
    $lines = explode( $split, str_replace( $keys, $repl, $rec ) );
    array_shift( $lines );
    $out = array();
    foreach ( $lines as $line ) $out[ substr( $line, 0, 2 ) ] = trim( substr( $line, 3 ) );
    return $out;
}

使用 2 或 3 字节关键字解析记录(假设键和内容之间有空格或 PHP_EOL):

function parseRecord( $keys, $rec ) {
    $split = chr(255);
    $repl = explode( ',', $split . implode( ','.$split, $keys ) );
    $lines = explode( $split, str_replace( $keys, $repl, $rec ) );
    array_shift( $lines );
    $out = array();
    foreach ( $lines as $line ) $out[ trim( substr( $line, 0, 3 ) ) ] = trim( substr( $line, 3 ) );
    return $out;
}

使用可变长度关键字解析记录(假设键和内容之间有空格或 PHP_EOL):

function parseRecord( $keys, $rec ) {
    $split = chr(255);
    $repl = explode( ',', $split . implode( ','.$split, $keys ) );
    $lines = explode( $split, str_replace( $keys, $repl, $rec ) );
    array_shift( $lines );
    $out = array();
    foreach ( $lines as $line ) {
        $keylen = strpos( $line.' ', ' ' );
        $out[ trim( substr( $line, 0, $keylen ) ) ] = trim( substr( $line, $keylen+1 ) );
    }
    return $out;
}

期望上面的每个 parseRecord 函数的性能都会比它的前任差一点。

结果:

Array
(
    [0] => Array
        (
            [HD] => Alcoa Earnings Soar; Outlook Stays Upbeat
            [BY] => By James R. Hagerty and Matthew Day
            [PD] => 12 July 2011
            [LP] => Alcoa Inc.'s profit more than doubled in the second quarter. The giant aluminum producer managed to meet analysts' forecasts.  However, profits wereless than expected
            [TD] => Licence this article via our website:  http://example.com
        )

)
于 2013-08-29T19:40:01.633 回答