0

我有一些 SRT 数据以 \r 和 \n 标记作为每个句子中间的换行符返回。如何仅在文本/句子中间找到那些 \r 和 \n 标记,而不是表示其他换行符的其他标记。

示例来源:

18
00:00:50,040 --> 00:00:51,890
All the women gather
at the hair salon,

19
00:00:52,080 --> 00:00:56,210
all the mothers and daughters
and they dye their hair orange.

期望的输出:

18
00:00:50,040 --> 00:00:51,890
All the women gather at the hair salon,

19
00:00:52,080 --> 00:00:56,210
all the mothers and daughters and they dye their hair orange.

我对正则表达式绝对是垃圾,但我最好的猜测(无济于事)是这样的

var reg = /[\d\r][a-zA-z0-9\s+]+[\r]/

然后对其进行 split() 以删除其中一个值中间的任何 \r 。我敢肯定这甚至不接近正确的方式,所以...stackoverflow!:)

4

2 回答 2

1

This regex should do the trick:

/(\d+\r\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}\r)([^\r]+)\r([^\r]+)(\r|$)/g

To make this work with more lines (has to be a set number) then just add more ([^\r]+)\r's. (Remember to also add $'s to the match replace as so (with 3 lines): '$1$2 $3 $4\r').

Usage

mystring = mystring.replace(/(\d+\r\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}\r)([^\r]+)\r([^\r]+)(\r|$)/g, '$1$2 $3\r');

Limitations

  • If there is more than 2 lines of text this won't work.

Example 1

Works fine!

Input:

18
00:00:50,040 --> 00:00:51,890
All the women gather
at the hair salon,

19
00:00:52,080 --> 00:00:56,210
all the mothers and daughters
and they dye their hair orange.

Output:

18
00:00:50,040 --> 00:00:51,890
All the women gather at the hair salon,

19
00:00:52,080 --> 00:00:56,210
all the mothers and daughters and they dye their hair orange

Example 2

Doesn't work; more than 2 lines

Input:

18
00:00:50,040 --> 00:00:51,890
All the women gather
at the hair salon,
and they just talk

19
00:00:52,080 --> 00:00:56,210
all the mothers and daughters
and they dye their hair orange.
Except for Maria who dyes it pink.

Output:

18
00:00:50,040 --> 00:00:51,890
All the women gather at the hair salon,
and they just talk

19
00:00:52,080 --> 00:00:56,210
all the mothers and daughters and they dye their hair orange.
Except for Maria who dyes it pink.
于 2012-10-22T21:13:44.637 回答
1

这将匹配您想要摆脱的换行符,捕获它之前和之后的字符,将这两个放回空格周围:

var regex = /([a-z,.;:'"])(?:\r\n?|\n)([a-z])/gi;
str = str.replace(regex, '$1 $2');

关于正则表达式的一些事情。我使用了修饰符ig使其不区分大小写并查找字符串中的所有换行符,而不是在第一个换行符之后停止。此外,它假定可移动换行符可以出现字母、逗号、句点、分号、冒号或单引号或双引号之后以及另一个字母之前。正如@nnnnnn 在上面的评论中提到的那样,这不会涵盖所有可能的句子,但它至少不应该被大多数标点符号窒息。换行符必须是单个换行符,但它与平台无关(可以是\r,\n\r\b)。我捕获了换行符之前的字符和换行符之后的字母(带括号),因此我可以在替换字符串中使用$1and访问它们$2。这基本上就是它的全部内容。

于 2012-10-22T20:57:05.397 回答