3

Here is an online programming contest we are planning to have.

What could be possible approaches to solving the same?

From a random IRC (Internet Relay Chat) log, a small percentage of the user nicknames will be randomly deleted. The participant’s code must be able to fill in the missing user nicks. In other words, this event requires you to come up with an intelligent program that can figure out “who could have said what”.

It may be assumed that all communication will be in modern English, with or without punctuation.

For example -

Original Chat: ... <user1>: Hey! <user2>: Hello! Where are you from, user1? <user3>: Can anybody help me out with Gnome installation? <user1>: India. user3, do you have the X Windows System installed? <user2>: Cool. What is Gnome, user3? <user3>: I don’t know. How do I check? <user3>: Its a desktop environment, user2. <user2>: Oh yeah! Just googled. <user1>: Type “startx” on the command line. Login as root and type “apt-get install gnome”. <user3>: Thanks! <user5>: I’m root, obey me! <user2>: Huh?! <user3>: user2, you better start using Linux! ...

The following only will be given to the participant.

Chat log with some nicks deleted:

..

: Hey! : Hello! Where are you from, user1? : Can anybody help me out with Gnome installation? : India. user3, do you have the X Windows System installed? : Cool. What is Gnome, user3? <%%%>: I don’t know. How do I check? <%%%>: Its a desktop environment, user2. : Oh yeah! Just googled. : Type “startx” on the command line. Login as root and type “apt-get install gnome”. : Thanks! <%%%>: I’m root, obey me! <%%%>: Huh?! : user2, you better start using Linux! ...

The participant’s code will have the task of replacing "<%%%>s" with the appropriate user nicks. In ambiguous cases, like the random comment by in the above example (which could have been said by any other user too!), the code should indicate the same.

4

2 回答 2

3

我想到了两件事:作者归属聊天解开。两者都不是您所描述的,但它们都非常接近。

作者归属是试图找出一组已知作者中的哪些作者撰写了特定文档的问题。经典的作者归属通常用于大段文本(例如戏剧、小说、演讲),但人们一直在尝试对来自互联网资源的较短文本样本进行同样的操作。一个很好的参考可能是Moshe Koppel写的任何标题中带有“作者身份”的东西,例如最近的论文Authorship Attribution in the Wild. 该任务的常用方法包括使用典型的文档分类方法,即使用词袋特征和机器学习分类器,对一组通常被认为是停用词(例如 as、of、the 等)。这里的问题是所有这些工作都在文档上,并没有考虑到 IRC 数据的会话性质。

聊天解缠结是从聊天数据中识别出许多连贯的“对话”的问题。这是一个相当困难的问题,因为您经常需要使用对话的上下文来了解谁在回复谁。我想这种方法对这项任务也很重要。例如,如果匿名消息是对话的一部分,那么这会将作者集限制为对话中的人。我真的只从Elsner 和 Charniak的论文Disentangling Chat中知道这一点。他们的“相关工作”部分很好地概述了该领域。

于 2011-07-07T10:06:48.220 回答
0

一种可能的解决方案是采用朴素贝叶斯分类器“垃圾邮件过滤器”的想法,看看不同昵称倾向于使用哪些词。根据哪个用户使用“最像”未知用户发送的消息对消息进行分类。这样做的缺点是,如果他们使用您以前从未见过的新词(这很可能),那么您需要了解更高级别的上下文信息。

于 2011-07-06T20:47:35.850 回答