我想将 GtkTextView 中的格式化文本提取为html或pango 标记语言。
有一个像这样格式化的小文本编辑器。所以形成元素是简单
<b>
的<i>
,等等。
有没有办法从 TexView 获取格式化文本?
You can use gtk_text_buffer_serialize()
. However, the only built-in serializer is GTK's internal text buffer format, so if you want HTML or Pango markup, you'll have to write the serializing function yourself.
Several years ago I wrote a GtkTextBuffer serializer for RTF. I don't know if it'll help you or inspire you to write your own.
我需要将带有 Pango 富文本标记的 Gtk TextBuffer 的内容转换为 HTML,即在应用程序中存储数据的格式,类似于您的要求。
我找不到开箱即用的简单方法,最后我编写了自己的转换器,从 Gtk 序列化内容到 html。
它使用html
标准库的一部分,正如我们已经拥有BeautifulSoup4
的依赖项一样,它也利用了它。
首先,我们定义了一个派生自Gtk.TextBuffer
该get_text
方法的类,它在设置时将内容作为文本或 HTML 返回
include_hidden_chars
:
class PangoBuffer(Gtk.TextBuffer):
def get_text(self,
start: Optional[Gtk.TextIter] = None,
end: Optional[Gtk.TextIter] = None,
include_hidden_chars: bool = False) -> str:
"""Get the buffer content.
If `include_hidden_chars` is set, then the html markup content is
returned. If False, then the text only is returned."""
if start is None:
start = self.get_start_iter()
if end is None:
end = self.get_end_iter()
if include_hidden_chars is False:
return super().get_text(start, end, include_hidden_chars=False)
else:
format_ = self.register_serialize_tagset()
content = self.serialize(self, format_, start, end)
return PangoToHtml().feed(content)
重要部分在else
块内。我更愿意开发自己的序列化程序,但文档很少。因此,我们使用内置的序列化器,返回二进制内容。
此内容基本上是带有额外页眉和页脚的 XML 标记:
# Truncated for legibility.
GTKTEXTBUFFERCONTENTS-0001\x00\x00\x07Z
<text_view_markup>
<tags>
<tag id="12" priority="12"> </tag> # Tags can be empty
<tag name="italic" priority="2">
<attr name="style" type="PangoStyle" value="PANGO_STYLE_ITALIC" />
</tag>
<tag id="7" priority="7">
<attr name="background-gdk" type="GdkColor" value="0:0:ffff" />
<attr name="style" type="PangoStyle" value="PANGO_STYLE_ITALIC" />
<attr name="weight" type="gint" value="700" />
</tag>
</tags>
<text>
<apply_tag name="italic">This is italic</apply_tag>
<apply_tag id="1">. </apply_tag>
<apply_tag id="2">This is italic</apply_tag>
<apply_tag id="3">\n </apply_tag>
<apply_tag id="7">This is bold, italic, and has background colouring.</apply_tag>
</text>
</text_view_markup>
由此,我们可以确定标签没有排序,它们可以有一个id
或一个name
。
包含id
的标签称为匿名标签,通常由 Pango 在反序列化内容时创建。
命名标签通常是在您的应用程序中定义的标签:
tag_bold = TextBuffer.create_tag("bold", weight=Pango.Weight.BOLD)
tag_italic = TextBuffer.create_tag("italic", style=Pango.Style.ITALIC)
tag_underline = TextBuffer.create_tag("underline", underline=Pango.Underline.SINGLE)
标头包含一个校验和,在调用时可能不会反序列化
bytes.decode
,因此必须在解码为 xml 字符串之前将其删除。
然后PangoToHtml
该类执行实际工作:
from html.parser import HTMLParser
from typing import Dict, List, Tuple
from bs4 import BeautifulSoup
from bs4.element import Tag
from gi.repository import Pango
class PangoToHtml(HTMLParser):
"""Decode a subset of Pango markup and serialize it as HTML.
Only the Pango markup used within Gourmet is handled, although expanding it
is not difficult.
Due to the way that Pango attributes work, the HTML is not necessarily the
simplest. For example italic tags may be closed early and reopened if other
attributes, eg. bold, are inserted mid-way:
<i> italic text </i><i><u>and underlined</u></i>
This means that the HTML resulting from the conversion by this object may
differ from the original that was fed to the caller.
"""
def __init__(self):
super().__init__()
self.markup_text: str = "" # the resulting content
self.current_opening_tags: str = "" # used during parsing
self.current_closing_tags: List = [] # used during parsing
# The key is the Pango id of a tag, and the value is a tuple of opening
# and closing html tags for this id.
self.tags: Dict[str: Tuple[str, str]] = {}
# Optionally, links can be specified, in a {link text: target} format.
self.links: Dict[str, str] = {}
# If links are specified, it is possible to ignore them, as is done with
# time links.
self.ignore_links: bool = False
# Used as heuristics for parsing links, when applicable.
self.is_colored_and_underlined: bool = False
tag2html: Dict[str, Tuple[str, str]] = {
Pango.Style.ITALIC.value_name: ("<i>", "</i>"), # Pango doesn't do <em>
str(Pango.Weight.BOLD.real): ("<b>", "</b>"),
Pango.Underline.SINGLE.value_name: ("<u>", "</u>"),
"foreground-gdk": (r'<span foreground="{}">', "</span>"),
"background-gdk": (r'<span background="{}">', "</span>")
}
@staticmethod
def pango_to_html_hex(val: str) -> str:
"""Convert 32 bit Pango color hex string to 16 html.
Pango string have the format 'ffff:ffff:ffff' (for white).
These values get truncated to 16 bits per color into a single string:
'#FFFFFF'.
"""
red, green, blue = val.split(":")
red = hex(255 * int(red, base=16) // 65535)[2:].zfill(2)
green = hex(255 * int(green, base=16) // 65535)[2:].zfill(2)
blue = hex(255 * int(blue, base=16) // 65535)[2:].zfill(2)
return f"#{red}{green}{blue}"
def feed(self, data: bytes) -> str:
"""Convert a buffer (text and and the buffer's iterators to html string.
Unlike an HTMLParser, the whole string must be passed at once, chunks
are not supported.
Optionally, a dictionary of links, in the format {text: target}, can be
specified. Links will be inserted if some text in the markup will be
coloured, underlined, and matching an entry in the dictionary.
If `ignore_links` is set, along with the `links` dictionary, then links
will be serialized as regular text, and the link targets will be lost.
"""
# Remove the Pango header: it contains a length mark, which we don't
# care about, but which does not necessarily decodes as valid char.
header_end = data.find(b"<text_view_markup>")
data = data[header_end:].decode()
# Get the tags
tags_begin = data.index("<tags>")
tags_end = data.index("</tags>") + len("</tags>")
tags = data[tags_begin:tags_end]
data = data[tags_end:]
# Get the textual content
text_begin = data.index("<text>")
text_end = data.index("</text>") + len("</text>")
text = data[text_begin:text_end]
# Convert the tags to html.
# We know that only a subset of HTML is handled in Gourmet:
# italics, bold, underlined, normal, and links (coloured & underlined)
soup = BeautifulSoup(tags, features="lxml")
tags = soup.find_all("tag")
tags_list = {}
for tag in tags:
opening_tags = ""
closing_tags = ""
# The tag may have a name, for named tags, or else an id
tag_name = tag.attrs.get('id')
tag_name = tag.attrs.get('name', tag_name)
attributes = [c for c in tag.contents if isinstance(c, Tag)]
for attribute in attributes:
vtype = attribute['type']
value = attribute['value']
name = attribute['name']
if vtype == "GdkColor": # Convert colours to html
if name in ['foreground-gdk', 'background-gdk']:
opening, closing = self.tag2html[name]
hex_color = self.pango_to_html_hex(value)
opening = opening.format(hex_color)
else:
continue # no idea!
else:
opening, closing = self.tag2html[value]
opening_tags += opening
closing_tags = closing + closing_tags # closing tags are FILO
tags_list[tag_name] = opening_tags, closing_tags
if opening_tags:
tags_list[tag_name] = opening_tags, closing_tags
self.tags = tags_list
# Create a single output string that will be sequentially appended to
# during feeding of text. It can then be returned once we've parse all
self.markup_text = ""
self.current_opening_tags = ""
self.current_closing_tags = [] # Closing tags are FILO
self.is_colored_and_underlined = False
super().feed(text)
return self.markup_text
def handle_starttag(self, tag: str, attrs: List[Tuple[str, str]]) -> None:
# The pango tags are either "apply_tag", or "text". We only really care
# about the "apply_tag". There could be an assert, but we let the
# parser quietly handle nonsense.
if tag == "apply_tag":
attrs = dict(attrs)
tag_name = attrs.get('id') # A tag may have a name, or else an id
tag_name = attrs.get('name', tag_name)
tags = self.tags.get(tag_name)
if tags is not None:
self.current_opening_tags, closing_tag = tags
self.current_closing_tags.append(closing_tag)
def handle_data(self, data: str) -> None:
data = self.current_opening_tags + data
self.markup_text += data
def handle_endtag(self, tag: str) -> None:
if self.current_closing_tags: # Can be empty due to closing "text" tag
self.markup_text += self.current_closing_tags.pop()
self.current_opening_tags = ""
根据 HTMLParser文档,它 是解析 HTML(超文本标记语言)和 XHTML 格式的文本文件的基础。我们知道我们想要处理开始和结束标签,以及它们之间的内容。
在序列化的内容中,标签由它们的名称或 id 引用,因此必须事先处理。
在这种情况下,我选择使用BeautifulSoup
,因为它提供了一种在简单循环中遍历 XML 标记的简单方法。
整个事情可以只用一个BeautifulSoup
或html
图书馆吗?可能是的,但我需要对各种链接的支持,所以最终结果会
有所不同,因为我需要HTMLParser
提供的灵活性。
这是一个基本的单元测试:
from pango_html import PangoToHtml
def test_convert_colors_to_html():
val = "0:0:0"
ret = PangoToHtml.pango_to_html_hex(val)
assert ret == "#000000"
val = "ffff:0:0"
ret = PangoToHtml.pango_to_html_hex(val)
assert ret == "#ff0000"
val = "0:ffff:0"
ret = PangoToHtml.pango_to_html_hex(val)
assert ret == "#00ff00"
val = "0:0:ffff"
ret = PangoToHtml.pango_to_html_hex(val)
assert ret == "#0000ff"
val = "ffff:ffff:ffff"
ret = PangoToHtml.pango_to_html_hex(val)
assert ret == "#ffffff"
val = "0:00000000:ffff" # add some arbitrary amounts of leading zeroes
ret = PangoToHtml.pango_to_html_hex(val)
assert ret == "#0000ff"
val = "ff00:d700:0000"
ret = PangoToHtml.pango_to_html_hex(val)
assert ret == "#fed600" # Gold
val = "ffff:1414:9393"
ret = PangoToHtml.pango_to_html_hex(val)
assert ret == "#ff1493" # Deep Pink
val = "4747:5f5f:9494"
ret = PangoToHtml.pango_to_html_hex(val)
assert ret == "#475f94" # Some Blue
val = "00fd:ffdc:ff5c"
ret = PangoToHtml.pango_to_html_hex(val)
assert ret == "#00fefe" # Some other blue
def test_pango_markup_to_html():
# These are examples found throughout the application
pango_markup = b'GTKTEXTBUFFERCONTENTS-0001\x00\x00\x07Z <text_view_markup>\n <tags>\n <tag id="12" priority="12">\n </tag>\n <tag id="2" priority="2">\n <attr name="style" type="PangoStyle" value="PANGO_STYLE_ITALIC" />\n </tag>\n <tag id="8" priority="8">\n </tag>\n <tag id="3" priority="3">\n </tag>\n <tag id="7" priority="7">\n <attr name="background-gdk" type="GdkColor" value="0:0:ffff" />\n </tag>\n <tag id="4" priority="4">\n <attr name="style" type="PangoStyle" value="PANGO_STYLE_ITALIC" />\n <attr name="weight" type="gint" value="700" />\n </tag>\n <tag id="5" priority="5">\n <attr name="style" type="PangoStyle" value="PANGO_STYLE_ITALIC" />\n <attr name="weight" type="gint" value="700" />\n <attr name="underline" type="PangoUnderline" value="PANGO_UNDERLINE_SINGLE" />\n </tag>\n <tag id="0" priority="0">\n <attr name="weight" type="gint" value="700" />\n </tag>\n <tag id="1" priority="1">\n </tag>\n <tag id="6" priority="6">\n </tag>\n <tag id="9" priority="9">\n <attr name="foreground-gdk" type="GdkColor" value="0:0:ffff" />\n </tag>\n <tag id="11" priority="11">\n <attr name="background-gdk" type="GdkColor" value="0:0:ffff" />\n <attr name="foreground-gdk" type="GdkColor" value="ffff:ffff:ffff" />\n </tag>\n <tag id="10" priority="10">\n </tag>\n </tags>\n<text><apply_tag id="0">This is bold</apply_tag><apply_tag id="1">. </apply_tag><apply_tag id="2">This is italic</apply_tag><apply_tag id="3">\n </apply_tag><apply_tag id="4">This is bold, italic, and </apply_tag><apply_tag id="5">underlined!</apply_tag><apply_tag id="6">\n </apply_tag><apply_tag id="7">This is a test of bg color</apply_tag><apply_tag id="8">\n </apply_tag><apply_tag id="9">This is a test of fg color</apply_tag><apply_tag id="10">\n </apply_tag><apply_tag id="11">This is a test of fg and bg color</apply_tag><apply_tag id="12">\n +</apply_tag></text>\n</text_view_markup>\n' # noqa
expected = '<b>This is bold</b>. <i>This is italic</i>\n <i><b>This is bold, italic, and </b></i><i><b><u>underlined!</u></b></i>\n <span background="#0000ff">This is a test of bg color</span>\n <span foreground="#0000ff">This is a test of fg color</span>\n <span background="#0000ff"><span foreground="#ffffff">This is a test of fg and bg color</span></span>\n +' # noqa
ret = PangoToHtml().feed(pango_markup)
assert ret == expected
pango_markup = b'GTKTEXTBUFFERCONTENTS-0001\x00\x00\x01i <text_view_markup>\n <tags>\n <tag name="italic" priority="1">\n <attr name="style" type="PangoStyle" value="PANGO_STYLE_ITALIC" />\n </tag>\n <tag name="bold" priority="0">\n <attr name="weight" type="gint" value="700" />\n </tag>\n </tags>\n<text>ddf<apply_tag name="bold">fd<apply_tag name="italic">df</apply_tag>fd</apply_tag>dff</text>\n</text_view_markup>\n' # noqa
expected = 'ddf<b>fd<i>df</i>fd</b>dff'
ret = PangoToHtml().feed(pango_markup)
assert ret == expected