html - Regex extract html source with multiple elements

Question

Before you tell me not to use Regex to parse html, I'm aware of this but my company uses Iconico Data Extractor to extract data from its website, and it allows you to create custom scripts, but it has to be regular expressions in javascript, I am therefore stuck with using RegEx to achieve my goal.

What I need is to take the following example html and extract each line

  <b>Item 1</b> Text <br>
  <b>Item 2</b> Text <br>
  <b>Item 3</b> Text <br>
<p><font color="#000000" face="Arial, Helvetica, sans-serif"><b>Item 4:</b></font></p>
<p><font color="#000000" face="Arial, Helvetica, sans-serif">Detailed Description</font></p>

What I need is to break down each item into an expression to retrieve all of the line complete with tags, exactly how it appears in the html. I have tried /<b>*details(.|\s)*?\/a>/gi Which gets me the Item 4. But I cannot work out how to get items 1 - 3, as what I require is just the line from to
/<b>*Item 1(.|\s)*?\br>/gi simply does not work and after hours of playing around with it i'm no further forward. I also need to get rid of the font tags too if thats possible. i think it's complicated by the fact that there is a closing </b> in the middle.

can anyone offer some advice on how to set up the expression. I already know that the general consenus is no to Regex, so no need to go down that route again :)

This is all quite new to me, so hope ive explained what im trying to do.

Thanks in advance

score 1 · Accepted Answer

在它正常工作之前，我已经使用正则表达式来解析 html。我使用了类似下面的东西。如您所见，有很多“。*？” 这意味着非贪婪匹配任何字符。很有用。

您使用什么语言？您可能必须设置选项以允许解析换行符，否则可能会将每一行视为单独的输入。

在 python 添加 re.DOTALL 选项。在 PHP 中有一个特殊的斜杠标签可以使用。

<b>(.*?)<br>.*?<b>(.*?)<br><b>(.*?)<br><p.*?sans-serif"><b>(.*?)</p>.*?serif">(.*?)</p>

score 0 · Accepted Answer

为了将它与数据提取器一起使用，我对在两个关键字之间获取数据进行了一些研究，并且(Item 1:.*?<br>)/gi效果很好。

不幸的是，我现在被告知必须从现在开始撕掉标签，所以我需要为那个标签挠头。如果我需要帮助，我会发布一个新问题。

非常感谢您的回复和帮助

html - Regex extract html source with multiple elements

2 回答 2

Related

Reference