3

我有一个文件,我从具有 RTF 格式标记的 Microsoft Lync 对话中提取值。示例文件如下:

{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset0 >Segoe UI;}{\f1\fnil Segoe UI;}} {\colortbl;\red0\green0\blue0 ;} {*\generator Riched20 15.0.4420}{*\mmathPr\mwrapIndent1440 }\viewkind4\uc1 \pard\cf1\embo\f0\fs20 Craig...\embo0 \embo please\embo0 \embo close\embo0 \embo >out\embo0 \embo of\embo0 \embo your\embo0 \embo old\embo0 \embo client\embo0 \embo >and\embo0 \embo re-open\embo0\f1\par {*\lyncflags rtf=1}}

使用 Lua 脚本,我试图删除 RTF 标签,然后只提取对话的文本。所以我的函数的结果应该是:

克雷格...请关闭您的旧客户并重新打开

我尝试使用带有正则表达式的 string.gsub 来匹配模式并用空格替换它们以只留下文本但它不起作用。这是我到目前为止的 string.gsub 代码:

result = string.gsub(s, "\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?", " ")

任何建议将不胜感激!

额外的:

user1@capital.com @ 2013-01-18 17:48:03Z (TO: user2@capital.com)

{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset0 Segoe UI;}{\f1\fnil Segoe UI;}} {\colortbl;\red0\green0\blue0; } {*\generator Riched20 15.0.4420}{*\mmathPr\mwrapIndent1440 }\viewkind4\uc1 \pard\cf1\embo\f0\fs20 works\embo0 \embo for\embo0 \embo me..\embo0 \embo how\ embo0 \embo about\embo0 \embo embedding\embo0 \embo 图片?\embo0\f1\par {*\lyncflags rtf=1}}

user1@capital.com @ 2013-01-18 17:48:57Z (TO: user2@capital.com)

{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset0 Segoe UI;}{\f1\fnil Segoe UI;}} {\colortbl;\red0\green0\blue0; } {*\generator Riched20 15.0.4420}{*\mmathPr\mwrapIndent1440 }\viewkind4\uc1 \pard\cf1\embo\f0\fs20 I\embo0 \embo see\embo0 \embo it\embo0\f1\par {* \lyncflags rtf=1}}

user1@capital.com @ 2013-01-18 17:49:27Z (TO: user2@capital.com)

{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset0 Segoe UI;}{\f1\fnil Segoe UI;}} {\colortbl;\red0\green0\blue0; } {*\generator Riched20 15.0.4420}{*\mmathPr\mwrapIndent1440 }\viewkind4\uc1 \pard\cf1\embo\f0\fs20 让我们\embo0 \embo try\embo0 \embo a\embo0 \embo 会议。\embo0 \f1\par {*\lyncflags rtf=1}}

4

2 回答 2

2

Lua 模式没有or运算符 ( |) 或可选分组 ( (?:...)?)。像这样的东西可能会起作用:

s:match("{(.+)}"):gsub("%b{}", ""):gsub("\\%w+", "")

将返回:

"    Craig...  please  close  >out  of  your  old  client  >and  re-open "

首先gsub删除所有对{}及其内容,第二个gsub删除所有 rtf 标签(尽管其中似乎有一些允许空格,因此您可能需要调整模式)。

于 2013-02-04T22:37:09.380 回答
0

尝试这个:

local s = '{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset0 >Segoe UI;}{\f1\fnil Segoe UI;}} {\colortbl ;\red0\green0\blue0;} {*\generator Riched20 15.0.4420}{*\mmathPr\mwrapIndent1440 }\viewkind4\uc1 \pard\cf1\embo\f0\fs20 Craig...\embo0 \embo please\embo0 \embo close\embo0 \embo >out\embo0 \embo of\embo0 \embo your\embo0 \embo old\embo0 \embo client\embo0 \embo >and\embo0 \embo re-open\embo0\f1\par {*\lyncflags rtf=1}}\n'
    ..'{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset0 Segoe UI;}{\f1\fnil Segoe UI;}} {\colortbl ;\red0\green0\blue0;} {*\generator Riched20 15.0.4420}{*\mmathPr\mwrapIndent1440 }\viewkind4\uc1 \pard\cf1\embo\f0\fs20 works\embo0 \embo for\embo0 \embo me..\embo0 \embo how\embo0 \embo about\embo0 \embo embedding\embo0 \embo pictures?\embo0\f1\par {*\lyncflags rtf=1}}\n'
    ..'{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset0 Segoe UI;}{\f1\fnil Segoe UI;}} {\colortbl ;\red0\green0\blue0;} {*\generator Riched20 15.0.4420}{*\mmathPr\mwrapIndent1440 }\viewkind4\uc1 \pard\cf1\embo\f0\fs20 I\embo0 \embo see\embo0 \embo it\embo0\f1\par {*\lyncflags rtf=1}}\n'
    ..'{\rtf1\fbidis\ansi\ansicpg1252\deff0\nouicompat\deflang1033{\fonttbl{\f0\fnil\fcharset0 Segoe UI;}{\f1\fnil Segoe UI;}} {\colortbl ;\red0\green0\blue0;} {*\generator Riched20 15.0.4420}{*\mmathPr\mwrapIndent1440 }\viewkind4\uc1 \pard\cf1\embo\f0\fs20 let\'s\embo0 \embo try\embo0 \embo a\embo0 \embo meeting.\embo0\f1\par {*\lyncflags rtf=1}}\n'
local text = string.gsub(s, '{(.-)}[}]?', ''):gsub('embo',''):gsub('0',''):gsub('iewkind4uc1 pardcf1',''):gsub('1par',''):gsub('s2',''):gsub('>','')
print(text)

输出

Craig... please close out of your old client and re-open
works for me.. how about embedding pictures?
I see it
let's try a meeting.

于 2019-05-18T20:49:54.150 回答