erlang - 如何计算文件日志中的唯一用户？

Question

给定一个txt日志文件，格式如下：

USER_A timestamp1 otherstuff
USER_B timestamp2 otherstuff
USER_C timestamp3 otherstuff
USER_A timestamp4 otherstuff
USER_A timestamp5 otherstuff
USER_C timestamp6 otherstuff
USER_B timestamp7 otherstuff

您将如何计算 erlang 中不同唯一用户的数量？我正在考虑逐行读取文件并使用 proplists 模块。每个用户将是一个键，其值将是出现次数。读取文件后，我调用：

length(proplists:get_keys(List)).

这是实现我的结果的正确方法吗？

score 4 · Accepted Answer

我也会为此使用 sets 模块，因为它既快速又不包含重复项。

以下代码应该可以完成这项工作：

{ok,Bin} = file:read_file("test"),
List = binary_to_list(Bin),
Usernames = [hd(string:tokens(X," ")) || X <- string:tokens(List,[$\n])],
sets:size(sets:from_list(Usernames)).

编辑：我删除了单行，因为它没有增加任何价值

score 3 · Accepted Answer

3

使用sets模块中的一组来存储用户名然后使用sets:size/1.

于 2012-06-16T09:05:48.527 回答

score 1 · Accepted Answer

日志文件通常很大，因此请考虑在递归函数中一次使用一行：

% Count the number of distinct users in the file named Filename                        
count_users(Filename) ->
    {ok, File} = file:open(Filename, [read, raw, read_ahead]),
    Usernames = usernames(File, sets:new()),
    file:close(File),
    sets:size(Usernames).

% Add all users in File, from the current file pointer position and forward,
% to Set.
% Side-effects: File is read and the file pointer is moved to the end.          
usernames(File, Set) ->
    case file:read_line(File) of
        {ok, Line} ->
            Username = hd(string:tokens(Line, " ")),
            usernames(File, sets:add_element(Username, Set));
        eof ->
            Set
    end.

你只是这样称呼它：count_users("logfile").

请注意，它usernames/2必须是尾递归才能有效地工作。否则它只会消耗更多的内存。

erlang - 如何计算文件日志中的唯一用户？

3 回答 3

Related

Reference