我正在一个 WordPress 网站上工作,其中一个页面列出了有关企业客户的摘录。
假设我有一个网页,其中可见文本如下所示:
"SuperAmazing.com, a subsidiary of Amazing, the leading provider of
integrated messaging and collaboration services, today announced the
availability of an enhanced version of its Enterprise Messaging
Service (CMS) 2.0, a lower cost webmail alternative to other business
email solutions such as Microsoft Exchange, GroupWise and LotusNotes
offerings."
但假设此文本中可能有 HTML 链接或图像,因此原始 HTML 可能如下所示:
<img src="/images/corporate/logos/super_amazing.jpg" alt="Company
logo for SuperAmazing.com" /> SuperAmazing.com, a subsidiary of
<a href="http://www.amazing.com/">Amazing</a>, the leading
provider of integrated messaging and collaboration services, today
announced the availability of an enhanced version of its Enterprise
Messaging Service (CMS) 2.0, a lower cost webmail alternative to other
business email solutions such as Microsoft Exchange, GroupWise and
LotusNotes offerings."
这是我需要做的:找出前 20 个可见单词中是否有链接。
这些是前 20 个可见词:
"SuperAmazing.com, a subsidiary of Amazing, the leading provider of
integrated messaging and collaboration services, today announced the
availability of an"
我需要将字符数(包括 HTML)计算到 20 个可见单词,在这种情况下是“an”,当然页面上的每个摘录都会有所不同。
(如果这能让事情变得更容易,我愿意将“SuperAmazing.com”算作 2 个词。)
我尝试了一些正则表达式来计算单词,但它们都计算 HTML,而不是可见单词。
那么,查找前 20 个可见单词的完整字符数(包括 HTML)的正确正则表达式是什么?