html - 删除与类关联的 HTML 标记

Question

我强迫自己学习如何仅在 AppleScript 中编写脚本，但我目前正面临一个问题，即尝试使用类删除特定标签。我试图找到可靠的文档和示例，但目前似乎非常有限。

这是我拥有的 HTML：

<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl <span class="foo">shoulder</span> biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class="bar">Pig brisket</span> jowl ham pastrami <span class="foo">jerky</span> strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>

我想要做的是删除一个特定的类，所以它会删除，结果：

<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl shoulder biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class="bar">Pig brisket</span> jowl ham pastrami jerky strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>

我知道如何使用do shell script终端并通过终端执行此操作，但我想了解通过 AppleScript 字典可用的内容。

在研究中，我能够找到一种方法来解析所有 HTML 标签：

on removeMarkupFromText(theText)
    set tagDetected to false
    set theCleanText to ""
    repeat with a from 1 to length of theText
        set theCurrentCharacter to character a of theText
        if theCurrentCharacter is "<" then
            set tagDetected to true
        else if theCurrentCharacter is ">" then
            set tagDetected to false
        else if tagDetected is false then
            set theCleanText to theCleanText & theCurrentCharacter as string
        end if
    end repeat
    return theCleanText
end removeMarkupFromText

但这会删除所有 HTML 标签，这不是我想要的。搜索所以我能够找到如何在使用 AppleScript 解析 HTML 源代码的标签之间进行提取，但我不想解析文件。

我熟悉下拉列表中的 BBEdit Balance Tags，Balance但是当我运行时：

tell application "BBEdit"
    activate
    find "<span class=\"foo\">" searching in text 1 of text document "test.html" options {search mode:grep, wrap around:true} with selecting match
    balance tags
end tell

它变得贪婪并抓住第一个标签到倒数第二个结束标签之间的整行，中间有文本，而不是把自己隔离到第一个标签的文本中。

tag在我确实遇到过find tag我可以做的字典中的进一步研究：set spanTarget to (find tag "span" start_offset counter)然后用类定位标签|class| of attributes of tag of spanTarget并使用balance tags，但我仍然遇到与以前相同的问题。

因此，在纯AppleScript 中，如何删除与类关联的标签而不使其变得贪婪？

score 1 · Accepted Answer

您可以在BBEdit或TextWranglerfind的命令中使用正则表达式：

要选择标签 ( Non-Greedy )，请使用以下命令：

find ".+?" searching in text 1 of text document 1 options {search mode:grep, wrap around:true} with selecting match

来自 .+?模式的信息：

. 匹配任何字符（换行符除外）
+ 表示任何字符的一个或多个重复
? 表示非贪婪量词
所以该模式匹配一个开始span标签，后跟一个或多个出现除回车以外的任何字符，然后是一个结束span标签，非贪婪量词达到了我们想要的结果，防止 BBEdit 超出结束标签并匹配多个标签。

要跨换行符匹配模式，只需放在(?s)模式的开头，如下所示：

find "(?s).+?" searching in text 1 of text document 1 options {search mode:grep, wrap around:true} with selecting match

该命令匹配没有换行符的标记：

shoulder

或者，该命令匹配带有换行符的标签：

shoulder 

或者，该命令匹配具有多行的标签：

shoulder xxxx yyyy zzzz

在 AppleScript 中，您可以使用替换命令（BBEdit或TextWrangler）来查找模式并删除所有匹配的字符串，如下所示

replace "(?s)<span class=\"foo\">.+?</span>" using "" searching in text 1 of text document 1 options {search mode:grep, wrap around:true}

score 0 · Accepted Answer

我相信 Ron 的回答是一个很好的方法，但是如果您不想使用正则表达式，可以使用下面的代码来实现。看到罗恩回答后，我不打算发布它，但我已经创建了它，所以我想我至少会给你第二个选择，因为你正在努力学习。

on run
    set theHTML to "<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl <span class=\"foo\">shoulder</span> biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class=\"bar\">Pig brisket</span> jowl ham pastrami <span class=\"foo\">jerky</span> strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>" 
    set theHTML to removeTag(theHTML, "<span class=\"foo\">", "</span>")
end run

on removeTag(theText, startTag, endTag)
    if theText contains startTag then
        set AppleScript's text item delimiters to {""}
        set AppleScript's text item delimiters to startTag
        set tempText to text items of (theText as string)
        set AppleScript's text item delimiters to {""}

        set middleText to item 2 of tempText as string
        if middleText contains endTag then
            set AppleScript's text item delimiters to endTag
            set tempText2 to text items of (middleText as string)
            set AppleScript's text item delimiters to {""}
            set newString to implode(tempText2, endTag)
            set item 2 of tempText to newString
        end if
        set newString to implode(tempText, startTag)
        removeTag(newString, startTag, endTag) -- recursive
    else
        return theText
    end if
end removeTag

on implode(parts, tag)
    set newString to items 1 thru 2 of parts as string
    if (count of parts) > 2 then
        set newList to {newString, items 3 thru -1 of parts}
        set AppleScript's text item delimiters to tag
        set newString to (newList as string)
        set AppleScript's text item delimiters to {""}
    end if
    return newString
end implode

score 0 · Accepted Answer

这是正则表达式的工作，可通过使用现在支持的 AppleScriptObjC 桥获得。将此代码粘贴到脚本编辑器并运行它：

use AppleScript version "2.5" -- for El Capitan or later
use framework "Foundation"
use scripting additions

on stringByMatching:thePattern inString:theString replacingWith:theTemplate
    set theNSString to current application's NSString's stringWithString:theString
    set theOptions to (current application's NSRegularExpressionDotMatchesLineSeparators as integer) + (current application's NSRegularExpressionAnchorsMatchLines as integer)
    set theExpression to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:theOptions |error|:(missing value)
    set theResult to theExpression's stringByReplacingMatchesInString:theNSString options:theOptions range:{location:0, |length|:theNSString's |length|()} withTemplate:theTemplate
    return theResult as text
end stringByMatching:inString:replacingWith:

set theHTML to "<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl <span class='foo'>SHOULDER</span> biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class='bar'>PIG BRISKET</span> jowl ham pastrami <span class='foo'>JERKY</span> strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>"

set modifiedHTML to its stringByMatching:"<span .*?>(.*?)</span>" inString:theHTML replacingWith:"$1"

这适用于格式良好的 HTML，但正如用户 foo 上面指出的那样，浏览器可以处理格式错误的 HTML，但您可能不能。

html - 删除与类关联的 HTML 标记

3 回答 3

Related

Reference