regex - 从html中提取城市状态和国家的正则表达式

Question

我正在使用 Outwit hub 为城市、州和国家（仅限美国和加拿大）抓取网站。通过该程序，我可以使用正则表达式来定义我希望抓取的文本之前和之后的标记。我还可以为所需文本定义格式。

这是一个html示例：

<td width="8%" nowrap="nowrap"></td>                        
<td width="22%" nowrap="nowrap"><strong>
BILLINGS, MT
USA</strong></td>
<td width="10%" align="right" nowrap="nowrap">

我已经设置了我的 reg.ex。如下：

CITY - 之前 （未格式化为正则表达式）

<td width="22%" nowrap="nowrap"><strong>

CITY - 之后 （说明州、领地和普罗旺斯）

/(,\s|\bA[BLKSZRAEP]\b|\bBC\b\bC[AOT]\b|\bD[EC]\b|\bF[LM]\b|\bG[AU]\b|\bHI\b|\bI[ADLN]\b|\bK[SY]\b|\bLA\b|\bM[ABDEHINOPST]\b|\bN[BLTSUCDEHJMVY]\b|\bO[HKNR]\b|\bP[AERW]\b|\bQC\b|\bRI\b|\bS[CDK]\b|\bT[NX]\b|\bUT\b|\bV[AIT]\b|\bW[AIVY]\b|\bYT\b|\bUSA|\bCanada)/

状态 - 之前

\<td width="22%" nowrap="nowrap"\>\<strong\>\s|,\s

状态 - 之后

/\bUSA\<\/strong\>\<\/td\>|\bCanada\<\/strong\>\<\/td\>/

状态 - 格式

/\b[A-Z][A-Z]\b/

国家 - 之前 （说明州、领地和普罗旺斯）

/(\bA[BLKSZRAEP]\b|\bBC\b\bC[AOT]\b|\bD[EC]\b|\bF[LM]\b|\bG[AU]\b|\bHI\b|\bI[ADLN]\b|\bK[SY]\b|\bLA\b|\bM[ABDEHINOPST]\b|\bN[BLTSUCDEHJMVY]\b|\bO[HKNR]\b|\bP[AERW]\b|\bQC\b|\bRI\b|\bS[CDK]\b|\bT[NX]\b|\bUT\b|\bV[AIT]\b|\bW[AIVY]\b|\bYT\b)\s/

国家 - 之后 （未格式化为正则表达式）

</strong></td><td width="10%" align="right" nowrap="nowrap">

当没有列出城市或州时，就会出现问题。我试图解释这一点，但只是让情况变得更糟。有什么办法可以清理它并仍然考虑丢失信息的可能性？谢谢你。

没有城市的例子：

<td width="8%" nowrap="nowrap"></td>                        
<td width="22%" nowrap="nowrap"><strong>
MT
USA</strong></td>
<td width="10%" align="right" nowrap="nowrap">

没有城市/州的示例：（是的，有一个额外的换行符）

<td width="8%" nowrap="nowrap"></td>                        
<td width="22%" nowrap="nowrap"><strong>

USA</strong></td>
<td width="10%" align="right" nowrap="nowrap">

感谢您提供任何帮助。

score 1 · Accepted Answer

如果您拥有专业版，您可以执行以下操作：

Description: Data
Before: <td width="22%" nowrap="nowrap"><strong>
After: </strong>
Format: (([\w \-]+),)? ?([A-Z]{2})?[\r\n](USA|canada)\s*
Replace: \2##\3##\4
Separator: ##
Labels: City,State,Country

如果您使用的是轻量版，则必须分三行进行：

Description: City
Before: <td width="22%" nowrap="nowrap"><strong>
After: ,
Format: [^<>]+

Description: State
Before: /<td width="22%" nowrap="nowrap"><strong>[\r\n]([^<>\r\n ]+,)?/
After: /[\r\n]/
Format: [A-Z]{2}

Description: Country
Before:
After: </strong></td>
Format: (USA|canada)

score 0 · Accepted Answer

TXR 文本抓取，数据处理语言：

@(collect)
<td width="8%" nowrap="nowrap"></td>
<td width="22%" nowrap="nowrap"><strong>
@  (cases)
@city, @state
@  (or)

@    (bind (city state) ("n/a" "n/a"))
@  (or)
@state
@    (bind city "n/a")
@  (end)
@country</strong></td>
<td width="10%" align="right" nowrap="nowrap">
@(end)
@(output)
CITY       STATE       COUNTRY
@  (repeat)
@{city 10} @{state 11} @country
@  (end)
@(end)

该文件city.html包含串联在一起的树案例。跑：

$ txr city.txr  city.html
CITY       STATE       COUNTRY
BILLINGS   MT          USA
n/a        MT          USA
n/a        n/a         USA

TXR HTML 抓取的另一个例子：Extract text from HTML Table

regex - 从html中提取城市状态和国家的正则表达式

2 回答 2

Related

Reference