7

我的经验告诉我不应该使用 RegExp 来解析 HTML/XML,我完全同意!它是

  • 凌乱
  • 不坚固且容易损坏
  • 纯粹的邪恶

他们都说某种“使用 DOM 解析器”,这对我来说很好。但现在我很好奇了。这些是如何工作的?

我正在搜索 DOMDocument 类源,但找不到它。

这个问题来自这样一个事实,filter_var()例如,它被认为是使用 RegExp 验证电子邮件的一个很好的替代方案,但是当您查看源代码时,您会发现它实际上使用了 RegExp 本身!

那么,如果您要在 PHP 中构建一个 DOM 解析器呢?您将如何解析 HTML?他们是如何做到的呢?

4

2 回答 2

5

I think you should check out the article How Browsers Work: Behind the Scenes of Modern Web Browsers. It's a lengthy read, but well worth your time. Specifically, the HTML Parser section.

While I cannot do the article justice, perhaps a cursory summary will be good to hold one over until they have the time to read and digest that masterpiece. I must admit though, in this area I am a novice having very little experience. Having developed for the web professionally for about 10 years, the way in which the browser handles and interprets my code has long been a black box.

HTML, XHTML, CSS or JavaScript - take your pick. They all have a grammer, as well as a vocabulary. English is another great example. We have grammatical rules that we expect people, books, and more to follow. We also have a vocabulary made up of nouns, verbs, adjectives and more.

Browsers interpret a document by examining its grammar, as well as its vocabulary. When it comes across items it ultimately doesn't understand, it will let you know (raising exceptions, etc). You and I do the same in common-speak.

I love StackOverflow, but if I could changed one thing it would be be absolutamente broken...

Note in the example above how you immediately start to pick apart the words and relationships between words. The beginning makes complete sense, "I love StackOverflow." Then we come to "...if I could changed," and we immediately stop. "Changed" doesn't belong here. It's likely the author meant "change" instead. Now the vocabulary is right, but the grammar is wrong. A little later we come across "be be" which may also violate a grammatical rule, and just a bit further we encounter the word "absolutamente", which is not part of the English vocabulary - another mistake.

Think of all of this in terms of a DOCTYPE. I have right now opened up on my second monitor the source behind XHTML 1.0 Strict Doctype. Among its internals are lines like the following:

<!ENTITY % heading "h1|h2|h3|h4|h5|h6">

This defines the heading entities. And as long as I adhere to the grammar of XHTML, I can use any one of these in my document (<h1>Hello World</h1>). But if I try to make one up, say H7, the browser will stumble over the vocabulary as "foreign," and inform me:

"Line 7, Column 8: element "h7" undefined"

Perhaps while parsing the document we come across <table. We know that we're now dealing with a table element, which has its own set of vocabulary such as tbody, tr, etc. As long as we know the language, the grammar rules, etc., we know when something is wrong. Returning to the XHTML 1.0 Strict Doctype, we find the following:

<!ELEMENT table
     (caption?, (col*|colgroup*), thead?, tfoot?, (tbody+|tr+))>
<!ELEMENT caption  %Inline;>
<!ELEMENT thead    (tr)+>
<!ELEMENT tfoot    (tr)+>
<!ELEMENT tbody    (tr)+>
<!ELEMENT colgroup (col)*>
<!ELEMENT col      EMPTY>
<!ELEMENT tr       (th|td)+>
<!ELEMENT th       %Flow;>
<!ELEMENT td       %Flow;>

Given this reference, we can keep a running check against whatever source we're parsing. If the author writes tread, instead of thead, we have a standard by which we can determine that to be in error. When issues are unresolved, and we cannot find rules to match certain uses of grammar and vocabulary, we inform the author that their document is invalid.

I am by no means doing this science justice, however I hope that this serves - if nothing more - to be enough that you might find it within yourself to sit down and read the article referenced as the beginning of this answer, and perhaps sit down and study the various DTD's that we encounter day to day.

于 2012-05-05T18:25:58.113 回答
1

好消息来了,你不需要重新发明轮子。libxml 库在 PHP 的 DOMDocument 扩展中使用,并且它的源代码可用。我建议去那里看看。

顺便说一句,正则表达式并不总是错误的,但是您需要正确使用它们,否则您会直接进入地狱厨房,成为小猫连环杀手或访问 chutullu 或如何称呼那个人。因此,我建议阅读以下内容:REX: XML Shallow Parsing with Regular Expressions

但是如果你做的一切都正确,正则表达式可以在解析方面帮助你很多。只是你应该知道你在做什么。

于 2012-05-05T17:14:52.320 回答