php - 在 php 中使用嵌入的 html 计算单词

Question

我有一些相当大的段落（5000-6000 字），其中包含文本和嵌入的 html 标签。我想把这个大段落分成 1500 个单词的块（忽略其中的 html 标记），即1500 应该只包含实际单词而不包括任何标记单词。 使用函数strip_tags我可以计算单词的数量（忽略 html 标记），但我无法弄清楚如何将它分成 1500 个单词的块（仍然包括 html 标记）。例如

This is <b> a </b> paragraph which <a href="#"> has some </a> some text to be broken in <h1> 5 words </h1>.

结果应该是

1 = This is <b> a </b> paragraph which
2 = <a href="#"> has some </a> some text to
3 = be broken in <h1> 5 words </h1>.

score 2 · Accepted Answer

Think about using explode() function wisely. Or better, but longer - regular expression that will match either a word or a tag with all text within it. You should consider elements inside html tags as unbreakable entity. For example, you can write a function, that breaks you large paragraph into following array of entities:

$data = array(
  array( "count" => 2, "text" => "This is "),
  array( "count" => 1, "text" => "<b> a </b>"),
  array( "count" => 2, "text" => " paragraph which"),
  ...
  etc.
);

Then, you should write a loop, that will make small paragraphs from $data array.

Also, sometimes it won't be possible to make your paragraph exactly 1500 words long. It can be more or less, because you should not separate you html tags.

score 1 · Accepted Answer

如果您想保证有效标记，我认为您将需要解析您的 html。在这种情况下，这个问题应该提供一个非常有用的起点。

score 0 · Accepted Answer

使用XML DOM Parser或HTML DOM Parser。

遍历所有节点
计算每个节点的单词
如果words超过N
- 创建父类型的新节点
- 将其作为兄弟姐妹插入父母之后
- 将当前和所有后续兄弟姐妹移动到它。
移动到下一个元素

php - 在 php 中使用嵌入的 html 计算单词

3 回答 3

Related

Reference