4

我有一些相当大的段落(5000-6000 字),其中包含文本和嵌入的 html 标签。我想把这个大段落分成 1500 个单词的块(忽略其中的 html 标记),即1500 应该只包含实际单词而不包括任何标记单词。 使用函数strip_tags我可以计算单词的数量(忽略 html 标记),但我无法弄清楚如何将它分成 1500 个单词的块(仍然包括 html 标记)。例如

This is <b> a </b> paragraph which <a href="#"> has some </a> some text to be broken in <h1> 5 words </h1>.

结果应该是

1 = This is <b> a </b> paragraph which
2 = <a href="#"> has some </a> some text to
3 = be broken in <h1> 5 words </h1>. 
4

3 回答 3

2

Think about using explode() function wisely. Or better, but longer - regular expression that will match either a word or a tag with all text within it. You should consider elements inside html tags as unbreakable entity. For example, you can write a function, that breaks you large paragraph into following array of entities:

$data = array(
  array( "count" => 2, "text" => "This is "),
  array( "count" => 1, "text" => "<b> a </b>"),
  array( "count" => 2, "text" => " paragraph which"),
  ...
  etc.
);

Then, you should write a loop, that will make small paragraphs from $data array.

Also, sometimes it won't be possible to make your paragraph exactly 1500 words long. It can be more or less, because you should not separate you html tags.

于 2012-12-18T14:59:15.197 回答
1

如果您想保证有效标记,我认为您将需要解析您的 html。在这种情况下,这个问题应该提供一个非常有用的起点。

于 2012-12-18T16:34:29.333 回答
0

使用XML DOM ParserHTML DOM Parser

  • 遍历所有节点
  • 计算每个节点的单词
  • 如果words超过N
    • 创建父类型的新节点
    • 将其作为兄弟姐妹插入父母之后
    • 将当前和所有后续兄弟姐妹移动到它。
  • 移动到下一个元素
于 2012-12-18T16:54:03.493 回答