1

我想从 irc 日志中提取文本。我有来自 irssi 的常规 IRC 日志,如下所示:

00:12 -!- Barbora [post@gw1-nat-041.roburnet.sk] has joined #post.sk
00:12 -!- mirinda [~post@195.91.55.136] has quit [Broken pipe]
00:12 -!- rogue1 [post@86-41-114-24-dynamic.b-ras2.lmk.limerick.eircom.net] has joined #post.sk
00:12 -!- Komunista is now known as Anonym9901
00:13 -!- ajka [~post@78.141.102.209] has quit [Client exited]
00:16 < blackmamba> no fuj
00:16 < blackmamba> Komunista: lol
00:16 < blackmamba> "este trochu"
00:16 < blackmamba> "je taky velky"
00:17 -!- majopo [post@adsl-d192.84-47-63.t-com.sk] has quit [Client exited]
00:19 -!- Anonym9901 is now known as Komunista
00:19 -!- dido84 [post@BSN-143-83-49.dial-up.dsl.siol.net] has quit [Client exited]
00:19 < Komunista> no?
00:20 < Komunista> ja by som*nadavka*l
00:20 < Komunista> ako pes
00:20 -!- Komunista is now known as Anonym53560 

我需要的是这样的输出:

no fuj lol este trochu je taky velky no ja by som*nadavka*l ako pes

所以,只是用空格分隔的单词,没有别的,没有刻痕,没有引号,问号等。我需要它作为 LDA 的输入。

我将通过后处理删除尼克斯,我认为这会更容易,还是?

我更喜欢带正则表达式的 PHP,我不擅长它,这就是为什么我向大家寻求帮助。

感谢您的时间!

编辑:

现在我使用这段代码(感谢 m.buettner):

$input = ... ;
$smiles = [">:]", ":-)", ":)", ":o)", ":]", ":3", ":c)", ":>", "=]", "8)", "=)", ":}", ":^)", ">:D", ":-D", ":D", "8-D", "x-D", "X-D", "=-D", "=D", "=-3", "8-)", ">:[", ":-(", ":(", ":-c", ":c", ":-<", ":-[", ":[", ":{", ">.>", "<.<", ">.<", ">;]", ";-)", ";)", "*-)", "*)", ";-]", ";]", ";D", ";^)", ">:P", ":-P", ":P", "X-P", "x-p", ":-p", ":p", "=p", ":-Þ", ":Þ", ":-b", ":b", "=p", "=P", ">:o", ">:O", ":-O", ":O", "°o°", "°O°", ":O", "o_O", "o.O", "8-0", ">:\\", ">:/", ":-/", ":-.", ":\\", "=/", "=\\", ":S", ":'(", ";'("];

$input = str_replace($smiles, '', $input);
$resultStr = '';
preg_match_all('/^\d\d:\d\d\s+<[%|\s|@|+][_a-zA-Z0-9]*>\s([^\r\n]*)/m', $input, $matches);
$resultStr = implode(' ', $matches[1]);
$resultStr = preg_replace('/[^\w\s*]+/', '', $resultStr);

preg_match_all('/<[%|\s|@|+][_a-zA-Z0-9]*>/m', $input, $nicks);
$nicks[0] = str_replace(['<', '>', ' ', '%', '+', '$', '@'], '', $nicks[0]);
$resultStr = str_replace($nicks[0], '', $resultStr);

任何改进它的建议将不胜感激;)

4

1 回答 1

1

像这样的东西?

preg_match_all('/^\d\d:\d\d\s+<[^>]*>([^\r\n]*)/m', $input, $matches);

$resultStr = implode(' ', $matches[1]);
$resultStr = preg_replace('/[^\w\s*]+/', '', $resultStr);

First we match everything after hh:mm < name> until the end of the line. Then we join those results together with spaces, and then we remove all non-word, non-space, non-asterisk characters. Add other character you want to keep to the character class in the preg_replace.

于 2012-11-06T13:25:43.223 回答