php - Regex Issue w/ Single Line XML

Question

I'm creating a word document via XML, the last step in the process is removing any blank lines. I appear to have a pattern that works when the xml is multiline; however, it's being generated as a single line which is breaking my preg_replace. Consider the following XML:

**<w:p** w:rsidR="009E48E3" w:rsidRPr="008C0DAB" w:rsidRDefault="009E48E3" w:rsidP="004E0AE3"><w:pPr><w:ind w:right="-540"/></w:pPr><w:r w:rsidRPr="008C0DAB">**<w:t>text that should be included</w:t>**</w:r>**</w:p><w:p** w:rsidR="009E48E3" w:rsidRPr="008C0DAB" w:rsidRDefault="009E48E3" w:rsidP="004E0AE3"><w:pPr><w:numPr><w:ilvl w:val="1"/> <w:numId w:val="10"/></w:numPr><w:tabs><w:tab w:val="clear" w:pos="1440"/><w:tab w:val="num" w:pos="1080"/></w:tabs><w:ind w:right="-540" w:hanging="720"/><w:rPr><w:noProof/></w:rPr></w:pPr><w:r><w:rPr><w:noProof/></w:rPr><w:lastRenderedPageBreak/>**<w:t> ; </w:t>**</w:r>**</w:p>**

Inserted the asterisk's simply to try and help readability

Blank lines are always in-between <w:t></w:t> tags and contain a period or semi-colon. Therefore, the first <w:p> tag should remain while the second should be removed.

Here is my pattern: <w:p .*<w:t>[ ]+?(\.|;)[ ]+?<\/w:t>.*?<\/w:p>

Any help is apprecriated, thank you!

score 1 · Accepted Answer

您的模式的问题在于，第一个.*将直接读取到 XML 的末尾，然后最终回溯到最后一个 <w:t>标记之前。从那里，模式的其余部分将成功匹配 XML 的其余部分。结果 - 整个 XML 被捕获！

Itchy 模式的问题在于，前瞻(?!.*w:p )表示“只有在<w:p>前面没有更多标签的情况下”。换句话说，模式将只匹配最后一个 <w:p>元素（如果它需要被删除，那就是）。

所有这些问题都源于.*. 我的两部分座右铭是，除非绝对必要，否则尽量不要使用它。然后，如果你发现绝对有必要使用它，请尽量不要使用它:)

以下模式将起作用：

<w:p [^<]++(?:(?!<w:t>)<[^<]++)++<w:t> *+[\.;] *+<\/w:t>[^<]*+(?:(?!<\/w:p>)<[^<]++)++<\/w:p>

笔记：

.*根本没用过！
所有格量词++and*+并不是真正需要的，但会加速正则表达式。
最后一部分可以简化为<\/w:t><\/w:r><\/w:p>如果元素总是这样结束。

score 0 · Accepted Answer

For this string you provided the following pattern works:

<w:p (?!.*w:p ).*?<w:t>[ ]+?(\.|;)[ ]+?<\/w:t>.*?<\/w:p>

I've tested it on Rubular.

It uses negative lookahead.

php - Regex Issue w/ Single Line XML

2 回答 2

Related

Reference