21

我需要解析实时聊天对话的记录。我看到文件的第一个想法是在问题上抛出正则表达式,但我想知道人们使用了哪些其他方法。

我在标题中加上了优雅,因为我之前发现这种类型的任务存在仅依靠正则表达式难以维护的危险。

成绩单由 www.providesupport.com 生成并通过电子邮件发送到一个帐户,然后我从电子邮件中提取纯文本成绩单附件。

解析文件的原因是提取对话文本以备后用,同时识别访问者和操作员的姓名,以便可以通过 CRM 提供信息。

以下是转录文件的示例:

Chat Transcript

Visitor: Random Website Visitor 
Operator: Milton
Company: Initech
Started: 16 Oct 2008 9:13:58
Finished: 16 Oct 2008 9:45:44

Random Website Visitor: Where do i get the cover sheet for the TPS report?
* There are no operators available at the moment. If you would like to leave a message, please type it in the input field below and click "Send" button
* Call accepted by operator Milton. Currently in room: Milton, Random Website Visitor.
Milton: Y-- Excuse me. You-- I believe you have my stapler?
Random Website Visitor: I really just need the cover sheet, okay?
Milton: it's not okay because if they take my stapler then I'll, I'll, I'll set the building on fire...
Random Website Visitor: oh i found it, thanks anyway.
* Random Website Visitor is now off-line and may not reply. Currently in room: Milton.
Milton: Well, Ok. But… that's the last straw.
* Milton has left the conversation. Currently in room:  room is empty.

Visitor Details
---------------
Your Name: Random Website Visitor
Your Question: Where do i get the cover sheet for the TPS report?
IP Address: 255.255.255.255
Host Name: 255.255.255.255
Referrer: Unknown
Browser/OS: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)
4

9 回答 9

12

不,实际上,对于您描述的特定类型的任务,我怀疑是否有比正则表达式“更清洁”的方法。看起来您的文件已经嵌入了换行符,所以通常我们在这里要做的是将行作为分解单元,应用每行正则表达式。同时,您创建一个小型状态机并使用正则表达式匹配来触发该状态机中的转换。这样您就知道您在文件中的位置,以及您可以期待哪些类型的字符数据。此外,考虑使用命名捕获组并从外部文件加载正则表达式。这样,如果您的成绩单格式发生变化,只需调整正则表达式即可,而不是编写新的特定于解析的代码。

于 2008-10-21T23:25:53.397 回答
11

使用 Perl,您可以使用Parse::RecDescent

这很简单,您的语法将在以后维护。

于 2008-10-22T03:01:20.187 回答
8

您可能需要考虑一个完整的解析器生成器。

正则表达式非常适合在文本中搜索小子字符串,但如果您真的有兴趣将整个文件解析为有意义的数据,它们的功能就会严重不足。

如果子字符串的上下文很重要,它们尤其不足。

大多数人在所有事情上都使用正则表达式,因为他们知道这一点。他们从未学习过任何解析器生成工具,他们最终编写了许多生产规则组合和语义动作处理,您可以使用解析器生成器免费获得这些内容。

正则表达式很棒,但如果您需要解析器,它们是无可替代的。

于 2008-10-22T00:23:33.343 回答
6

这是两个基于lepl解析器生成器库的解析器。它们都产生相同的结果。

from pprint import pprint
from lepl import AnyBut, Drop, Eos, Newline, Separator, SkipTo, Space

# field = name , ":" , value
name, value = AnyBut(':\n')[1:,...], AnyBut('\n')[::'n',...]    
with Separator(~Space()[:]):
    field = name & Drop(':') & value & ~(Newline() | Eos()) > tuple

header_start   = SkipTo('Chat Transcript' & Newline()[2])
header         = ~header_start & field[1:] > dict
server_message = Drop('* ') & AnyBut('\n')[:,...] & ~Newline() > 'Server'
conversation   = (server_message | field)[1:] > list
footer_start   = 'Visitor Details' & Newline() & '-'*15 & Newline()
footer         = ~footer_start & field[1:] > dict
chat_log       = header & ~Newline() & conversation & ~Newline() & footer

pprint(chat_log.parse_file(open('chat.log')))

更严格的解析器

from pprint import pprint
from lepl import And, Drop, Newline, Or, Regexp, SkipTo

def Field(name, value=Regexp(r'\s*(.*?)\s*?\n')):
    """'name , ":" , value' matcher"""
    return name & Drop(':') & value > tuple

Fields = lambda names: reduce(And, map(Field, names))

header_start   = SkipTo(Regexp(r'^Chat Transcript$') & Newline()[2])
header_fields  = Fields("Visitor Operator Company Started Finished".split())
server_message = Regexp(r'^\* (.*?)\n') > 'Server'
footer_fields  = Fields(("Your Name, Your Question, IP Address, "
                         "Host Name, Referrer, Browser/OS").split(', '))

with open('chat.log') as f:
    # parse header to find Visitor and Operator's names
    headers, = (~header_start & header_fields > dict).parse_file(f)
    # only Visitor, Operator and Server may take part in the conversation
    message = reduce(Or, [Field(headers[name])
                          for name in "Visitor Operator".split()])
    conversation = (message | server_message)[1:]
    messages, footers = ((conversation > list)
                         & Drop('\nVisitor Details\n---------------\n')
                         & (footer_fields > dict)).parse_file(f)

pprint((headers, messages, footers))

输出:

({'Company': 'Initech',
  'Finished': '16 Oct 2008 9:45:44',
  'Operator': 'Milton',
  'Started': '16 Oct 2008 9:13:58',
  'Visitor': 'Random Website Visitor'},
 [('Random Website Visitor',
   'Where do i get the cover sheet for the TPS report?'),
  ('Server',
   'There are no operators available at the moment. If you would like to leave a message, please type it in the input field below and click "Send" button'),
  ('Server',
   'Call accepted by operator Milton. Currently in room: Milton, Random Website Visitor.'),
  ('Milton', 'Y-- Excuse me. You-- I believe you have my stapler?'),
  ('Random Website Visitor', 'I really just need the cover sheet, okay?'),
  ('Milton',
   "it's not okay because if they take my stapler then I'll, I'll, I'll set the building on fire..."),
  ('Random Website Visitor', 'oh i found it, thanks anyway.'),
  ('Server',
   'Random Website Visitor is now off-line and may not reply. Currently in room: Milton.'),
  ('Milton', "Well, Ok. But… that's the last straw."),
  ('Server',
   'Milton has left the conversation. Currently in room:  room is empty.')],
 {'Browser/OS': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; InfoPath.1; .NET CLR 2.0.50727)',
  'Host Name': '255.255.255.255',
  'IP Address': '255.255.255.255',
  'Referrer': 'Unknown',
  'Your Name': 'Random Website Visitor',
  'Your Question': 'Where do i get the cover sheet for the TPS report?'})
于 2009-11-01T16:17:57.083 回答
5

构建解析器?我无法确定您的数据是否足够定期,但可能值得研究。

于 2008-10-22T00:03:23.323 回答
4

使用多行注释正则表达式可以在一定程度上缓解维护问题。尽量避免使用单行超级正则表达式!

此外,考虑将正则表达式分解为单独的任务,每个任务对应您想要获得的每个“事物”。例如。

visitor = text.find(/Visitor:(.*)/)
operator = text.find(/Operator:(.*)/)
body = text.find(/whatever....)

代替

text.match(/Visitor:(.*)\nOperator:(.*)...whatever to giant regex/m) do
  visitor = $1
  operator = $2
  etc.
end

然后,它可以轻松更改任何特定项目的解析方式。至于解析具有许多“聊天块”的文件,只需一个简单的正则表达式匹配单个聊天块,迭代文本并将匹配数据从中传递给您的其他匹配器组。

这显然会影响性能,但除非您处理大量文件,否则我不会担心。

于 2008-10-21T23:18:14.210 回答
2

我使用过 Paul McGuire 的 pyParsing 类库,我一直对它印象深刻,因为它文档齐全、易于上手,而且规则易于调整和维护。顺便说一句,规则在您的 python 代码中表示。显然,日志文件有足够的规律性将每一行作为一个独立的单元进行解析。

于 2008-10-22T17:05:41.823 回答
2

考虑使用 Ragel https://www.colm.net/open-source/ragel/

这就是引擎盖下的杂种力量。多次解析字符串会大大减慢速度。

于 2008-10-22T00:36:12.943 回答
0

只是一个快速的帖子,我只看了你的成绩单示例,但我最近也不得不研究文本解析,并希望避免走手动解析的路线。我确实遇到了Ragel,我才刚刚开始了解它,但它看起来非常有用。

于 2008-10-22T00:11:12.963 回答