php - 使用正则表达式提取值的正确方法是什么

Question

PS：我不能对这段代码使用 DOM 等，因为 Xpath 不适用于 html 代码，其中包含来自管理不善的站点的大量错误。这对我来说将是最简单的方法。

我从错误的 html 代码中获得了以下 html 片段：

<td width="11%">Train Number</Td>
<td width="16%">Train Name</td>
<td width="18%">Boarding Date <br>(DD-MM-YYYY)</td>

<td width="7%">From</Td>
<td width="7%">To</Td>
<td width="14%">Reserved Upto</Td>
<td width="21%">Boarding Point</Td>
<td width="6%">Class</Td>
</TR>
<TR>
<TD class="table_border_both">*12018</TD>
<TD class="table_border_both">DEHRADUN SHTBDI</TD>
<TD class="table_border_both"> 9- 9-2012</TD>

<TD class="table_border_both">DDN </TD>
<TD class="table_border_both">RK  </TD>
<TD class="table_border_both">RK  </TD>
<TD class="table_border_both">DDN </TD>
<TD class="table_border_both"> CC</TD>

我想使用正则表达式获取最后 8 个 TD 中的值。因此，如果我将其放在 heredoc 中，则它不匹配。我应该如何放置在heredoc中，以便这个模式（原样）匹配？

我正在尝试这样做：

 $trainpattern = <<<EOT
<td width="11%">Train Number</Td>
<td width="16%">Train Name</td>
<td width="18%">Boarding Date <br>[(]DD-MM-YYYY[)]</td>

<td width="7%">From</Td>
<td width="7%">To</Td>
<td width="14%">Reserved Upto</Td>
<td width="21%">Boarding Point</Td>
<td width="6%">Class</Td>
</TR>
<TR>
<TD class="table_border_both">[*]12018</TD>
<TD class="table_border_both">DEHRADUN SHTBDI</TD>
<TD class="table_border_both"> 9- 9-2012</TD>

<TD class="table_border_both">DDN </TD>
<TD class="table_border_both">RK  </TD>
<TD class="table_border_both">RK  </TD>
<TD class="table_border_both">DDN </TD>
<TD class="table_border_both"> CC</TD>
EOT;


$ret = preg_match("#$trainpattern#s",$filetext,$matches);

此外，当我只取前两行并将它们与 \s+ 加入单行时，它是匹配的，但我正在寻找匹配行而不加入它们的方法。可能在那种情况下，我需要将 \n\r's 替换为 \s*'s。

score 2 · Accepted Answer

要提取值，您可以使用类似的东西：

<?php

// Note: I add <TR></TR> to match

$trainpattern = <<< EOT
<TR>
<td width="11%">Train Number</Td>
<td width="16%">Train Name</td>
<td width="18%">Boarding Date <br>(DD-MM-YYYY)</td>

<td width="7%">From</Td>
<td width="7%">To</Td>
<td width="14%">Reserved Upto</Td>
<td width="21%">Boarding Point</Td>
<td width="6%">Class</Td>
</TR>

<TR>
<TD class="table_border_both">[*]12018</TD>
<TD class="table_border_both">DEHRADUN SHTBDI</TD>
<TD class="table_border_both"> 9- 9-2012</TD>

<TD class="table_border_both">DDN </TD>
<TD class="table_border_both">RK  </TD>
<TD class="table_border_both">RK  </TD>
<TD class="table_border_both">DDN </TD>
<TD class="table_border_both"> CC</TD>
</TR>
EOT;

// $trs will contains each TRs
$trs=array();
preg_match_all("|<tr>(.+)</tr>|Uis", $trainpattern, $trs);

// $keys will contains TD's value of first TR
preg_match_all("|<td.*>(.+)</td>|Uis", $trs[1][0], $keys);

// $values will contains TD's value of second TR
preg_match_all("|<td.*>(.+)</td>|Uis", $trs[1][1], $values);

// We join keys and values 
$results = array();
foreach ($keys[1] as $index => $key) {
    if (isset($values[1][$index])) {
       $results[$key] = $values[1][$index];
    }
}

var_dump($results);

这将向您展示：

array(8) {
  ["Train Number"]=>
  string(8) "[*]12018"
  ["Train Name"]=>
  string(15) "DEHRADUN SHTBDI"
  ["Boarding Date <br>(DD-MM-YYYY)"]=>
  string(10) " 9- 9-2012"
  ["From"]=>
  string(4) "DDN "
  ["To"]=>
  string(4) "RK  "
  ["Reserved Upto"]=>
  string(4) "RK  "
  ["Boarding Point"]=>
  string(4) "DDN "
  ["Class"]=>
  string(3) " CC"
}

score 1 · Accepted Answer

你试过phpQuery吗？如果您曾经使用过 jQuery，这将不是问题。

例子：

require 'phpQuery.php';
phpQuery::newDocumentHTML($trainpattern);
foreach (pq('td')->slice(-8) as $v) {
    $v = pq($v);
    var_dump((string)$v);
    var_dump((string)$v->attr('class'));
    # etc...
}

输出：

string(43) "[*]12018"
string(50) "DEHRADUN SHTBDI"
string(45) " 9- 9-2012"
string(39) "DDN "
string(39) "RK  "
string(39) "RK  "
string(39) "DDN "
string(38) " CC"

php - 使用正则表达式提取值的正确方法是什么

2 回答 2

Related

Reference