regex - 将 lst 文件转换为 csv 所需的正则表达式帮助

Question

我有一个从 IMDB Interfaces 下载的文件 (ratings.lst)。内容似乎采用以下格式：-

Distribution   Votes      Rating  Title
0000001222     297339     8.4     Reservoir Dogs (1992)
0000001223     64504      8.4     The Third Man (1949)
0000000115     48173      8.4     Jodaeiye Nader az Simin (2011)
0000001232     324564     8.4     The Prestige (2006)
0000001222     301527     8.4     The Green Mile (1999)

我的目标是将此文件转换为具有以下所需结果的 CSV 文件（逗号分隔）（例如 1 行）：

Distribution   Votes      Rating  Title
0000001222,    301527,    8.4,    The Green Mile (1999)

我正在使用文本板，它支持基于正则表达式的搜索和替换。我不确定需要哪种类型的正则表达式才能达到上述预期结果。有人可以帮我解决这个问题。提前致谢。

score 0 · Accepted Answer

The other regular expressions are somewhat overcomplicated. Because whitespace is guaranteed not to appear in the first three columns, you don't have to do a fancy match - "three columns of anything separated by whitepace" will do.

Try replacing ^(.+?)\s+(.+?)\s+(.+?)\s+(.+?)$ with \1,\2,\3,"\4" giving the following output (using Notepad++)

Distribution,Votes,Rating,"Title"
0000001222,297339,8.4,"Reservoir Dogs (1992)"
0000001223,64504,8.4,"The Third Man (1949)"
0000000115,48173,8.4,"Jodaeiye Nader az Simin (2011)"
0000001232,324564,8.4,"The Prestige (2006)"
0000001222,301527,8.4,"The Green Mile (1999)"

Note the use of a non-greedy quantifier, .+?, to prevent accidentally matching more than we should. Also note that I've enclosed the fourth column with quote marks "" in case a comma appears in the movie title - otherwise the software you use to read the file would interpret Avatar, the Last Airbender as two columns.

The nice tabular alignment is gone - but if you open the file in Excel it will look fine.

Alternately, just do the entire thing in Excel.

score 0 · Accepted Answer

首先将所有替换"为""然后执行此操作：

查找：^$[0-9]+$[ \t]+$[0-9]+$[ \t]+$[^ \t]+$[ \t]+$.*$
替换为：\1,\2,\3,"\4"

score 0 · Accepted Answer

按 F8 打开替换对话框
确保选择正则表达式
在查找内容中：放：^([[:digit:]]{10})[[:space:]]+([[:digit:]]+)[[:space:]]+([[:digit:]]- {1,2}\.[[:digit:]])[[:space:]]+(.*)$
替换为：放\1,\2,\3,"\4"
点击全部替换

在此处输入图像描述

注意：这在 rating.lst 中的字段之间使用 1 个或多个空格 - 如果您知道，最好指定确切的空格数。

另请注意：我没有在逗号分隔的项目之间放置空格，一般情况下你不会，但可以随意添加

最后说明：我将电影标题放在引号中，这样如果它包含逗号，它就不会破坏 CSV 格式。您可能希望以不同的方式处理此问题。

score 0 · Accepted Answer

MY BAD这是一个 C# 程序。我会把它留给替代解决方案。

ignorepattern 空白用于注释模式。

这将创建可以放入 CSV 文件的数据。请注意，根据您的示例，CSV 文件中没有可选的 whitepsace....

string data =@"Distribution   Votes      Rating  Title
0000001222     297339     8.4     Reservoir Dogs (1992)
0000001223     64504      8.4     The Third Man (1949)
0000000115     48173      8.4     Jodaeiye Nader az Simin (2011)
0000001232     324564     8.4     The Prestige (2006)
0000001222     301527     8.4     The Green Mile (1999)
";

string pattern = @"
^                     # Always start at the Beginning of line
(                     # Grouping
   (?<Value>[^\s]+)     # Place all text into Value named capture
   (?:\s+)              # Match but don't capture 1 to many spaces
){3}                  # 3 groups of data
(?<Value>[^\n\r]+)    # Append final to value named capture group of the match
";

var result = Regex.Matches(data, pattern, RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace)
                  .OfType<Match>()
                  .Select (mt => string.Join(",", mt.Groups["Value"].Captures
                                                                    .OfType<Capture>()
                                                                    .Select (c => c.Value))
                                                                    );

Console.WriteLine (result);

/* output
Distribution,Votes,Rating,Title
0000001222,297339,8.4,Reservoir Dogs (1992)
0000001223,64504,8.4,The Third Man (1949)
0000000115,48173,8.4,Jodaeiye Nader az Simin (2011)
0000001232,324564,8.4,The Prestige (2006)
0000001222,301527,8.4,The Green Mile (1999)
*/

regex - 将 lst 文件转换为 csv 所需的正则表达式帮助

4 回答 4

Related

Reference