regex - multilevel grep

Question

I have a series of HTML files that are formatted like this:

cinema name
 film 1
  showtime 1
  showtime 2
  ...

 film 2
  showtime 1
  showtime 2
  showtime 3
  ...

the name of the cinema is listed only one, at the top; then there is a list of films (any number of films could be here, from 1 to n), and then a list of showtimes (again, it could be 1 or more during the day.

i would like to extract this info using grep and output something like:

cinema name - film 1 - showtime 1
cinema name - film 1 - showtime 2
cinema name - film 2 - showtime 1
cineme name - film 2 - showtime 2
cinema name - film 2 - showtime 3
etc.

however, i'm not sure whether/how i can accomplish this with grep. is it possible? if so, how?

score 1 · Accepted Answer

您不必使用单个正则表达式来解决所有问题。在这种情况下，我只需通过前导空格找出我有哪一行，记住电影院和电影的值是什么，然后在我遇到放映时间时将它们全部打印出来。虽然这个解决方案是在 Perl 中，但你可以用你选择的任何语言做同样的事情：

#!perl
use v5.10;

my( $cinema, $film );
while( <DATA> ) {
    chomp;
    if( /\A\S/ )            { $cinema = $_ }
    elsif( /\A\s(\S.*)/ )   { $film = $1 }
    elsif( /\A\s\s(\S.*)/ ) { say "$cinema - $film - $1" }
    }   


__END__
Regal 9
 Jaws
  15:00
  19:00
  21:00

 Star Wars
  16:00
  17:00
  18:00

AMC 18
 E.T.
  12:00
  14:00

 Black Sheep
  22:00
  01:00
  03:00

这是一个丑陋的 Perl 单行版本：

perl -lne '(/\A\S/ and $c=$_) || (/\A\s(\S.*)/ and $f=$1) || (/\A\s\s(\S.*)/ and print"$c - $f - $1")' movies.txt

score 0 · Accepted Answer

一个表达式是不可能的，但你可以用五个来完成：

删除空行（简化一些东西）：查找：“\n\n”替换：“\n”

填写电影：

（在电影之后查找以任意数量的先前放映时间为前缀的放映时间。电影被捕获，然后添加到放映时间的开头。）

Find: "(?<=\n ([^ \n].+)(\n  .*)*)\n  "
Replace: "\n  $1 - "

填满电影院：

（在电影院之后查找以任意数量的先前放映时间或电影为前缀的放映时间。电影被捕获，然后添加到放映时间的开头。）

Find: "(?<=(?:^|\n)([^ \n].+)(\n {1,2}.*)*)\n  "
Replace: "\n  $1 - "

删除非放映时间行：

Find: "(?<=^|\n)(?!  ).*\n"
Replace: ""

修剪放映时间：

Find: "\n  "
Replace: "\n"

所有这些都未经测试，并假定带有\n行终止符的类似 .NET 的正则表达式语法。调整口味。

regex - multilevel grep

2 回答 2

Related

Reference