1

我正在尝试在第 n 次出现 2 个模式之间提取数据。

模式一: CardDetail

模式二: ]

输入文件input.txt有数千行,每行包含的内容各不相同。我关心从中获取数据的行将始终包含CardDetail在行中的某个位置。使用 查找匹配行很容易awk,但是在每个匹配项之间提取数据并将其放在单独的行上是我的不足之处。

input.txt包含有关网络设备和任何附加/子设备的数据。它看起来像这样:

DeviceDetail [baseProductId=router-5000, cardDetail=[CardDetail [baseCardId=router-5000NIC1, cardDescription=Router 5000 NIC, cardSerial=5000NIC1], CardDetail [baseCardId=router-5000NIC2, cardDescription=Router 5000 NIC, cardSerial=5000NIC2]], deviceSerial=5000PRIMARY, deviceDescription=Router 5000 Base Model]
DeviceDetail [baseProductId=router-100, cardDetail=[CardDetail [baseCardId=router-100NIC1, cardDescription=Router 100 NIC, cardSerial=100NIC1], CardDetail [baseCardId=router-100NIC2, cardDescription=Router 100 NIC, cardSerial=100NIC2]], deviceSerial=100PRIMARY, deviceDescription=Router 100 Base Model]

* 更新:我忘了在最初的帖子中提到我还需要deviceSerial列出设备的父序列号 ( )。*

我希望output.txt看起来像这样:

"router-5000NIC1","Router 5000 NIC","5000NIC1","5000PRIMARY"
"router-5000NIC2","Router 5000 NIC","5000NIC2","5000PRIMARY"
"router-100NIC1","Router 100 NIC","100NIC1","100PRIMARY"
"router-100NIC2","Router 100 NIC","100NIC2","100PRIMARY"

CardDetail单行上出现的次数可能在 0 到数百之间变化,具体取决于设备。我需要能够在每次出现CardDetail和下一次出现之间按字段提取所有数据,]并将它们以 CSV 格式传输到自己的行。

4

4 回答 4

2

如果您有 gawk 或 mawk 可用,您可以通过(错误)使用记录和字段拆分功能来做到这一点:

awk -v RS='CardDetail *\\[' -v FS='[=,]' -v OFS=',' -v q='"' '
  NR > 1 { sub("\\].*", ""); print q $2 q, q $4 q, q $6 q }'

输出:

"router-5000NIC1","Router 5000 NIC","5000NIC1"
"router-5000NIC2","Router 5000 NIC","5000NIC2"
"router-100NIC1","Router 100 NIC","100NIC1"
"router-100NIC2","Router 100 NIC","100NIC2"
于 2013-01-21T21:58:23.357 回答
1

够了吗?

$> grep -P -o "(?<=CardDetail).*?(?=\])" input.txt | grep -P -o "(?<=\=).*?(?=\,)"
router-5000NIC1
Router 5000 NIC
router-5000NIC2
Router 5000 NIC
router-100NIC1
Router 100 NIC
router-100NIC2
Router 100 NIC
于 2013-01-21T21:02:07.497 回答
0

这是一个使用正则表达式的示例。如果文本格式有细微的变化,这将处理它们。这也收集了数组中的所有值;然后,如果您愿意,您可以进行进一步处理(对值进行排序、删除重复项等)。

#!/usr/bin/awk -f

BEGIN {
    i_result = 0
    DQUOTE = "\""
}

{
    line = $0
    for (;;)
    {
        i = match(line, /CardDetail \[ **([^]]*) *\]/, a)
        if (0 == i)
            break
        # a[1] has the text from the parentheses
        s = a[1]
        # replace from this: a, b, c   to this:  "a","b","c"
        gsub(/ *, */, "\",\"", s)
        s = DQUOTE s DQUOTE

        results[i_result++] = s
        line = substr(line, RSTART + RLENGTH - 1)
    }
}

END {
    for (i = 0; i < i_result; ++i)
        print results[i]
}

PS 只是为了好玩,我制作了一个 Python 版本。

#!/usr/bin/python

import re
import sys

DQUOTE = "\""

pat_card = re.compile("CardDetail \[ *([^]]*) *\]")
pat_comma = re.compile(" *, *")

results = []

def collect_cards(line, results):
    while True:
        m = re.search(pat_card, line)
        if not m:
            return
        len_matched = len(m.group(0))
        s = m.group(1)
        s = DQUOTE + re.sub(pat_comma, '","', s) + DQUOTE
        results.append(s)
        line = line[len_matched:]

if __name__ == "__main__":
    for line in sys.stdin:
        collect_cards(line, results)

    for card in results:
        print card

编辑:这是一个新版本,它还查找“deviceID”并将匹配的文本作为第一个字段。

在 AWK 中,您只需将字符串在表达式中彼此相邻放置即可连接;当两个字符串并排时,存在隐式连接运算符。因此,这会将 deviceID 文本放入一个名为 s0 的变量中,并使用连接在其周围加上双引号;然后稍后使用连接将 s0 放在匹配字符串的开头。

#!/usr/bin/awk -f

BEGIN {
    i_result = 0
    DQUOTE = "\""
    COMMA = ","
}

{
    line = $0
    for (;;)
    {
        i = match(line, /deviceID=([A-Za-z_0-9]*),/, a)
        s0 = DQUOTE a[1] DQUOTE
        i = match(line, /CardDetail \[ **([^]]*) *\]/, a)
        if (0 == i)
            break
        # a[1] has the text from the parentheses
        s = a[1]
        # replace from this: foo=a, bar=b, other=c   to this:  "a","b","c"
        gsub(/[A-Za-z_][^=,]*=/, "", s)
        # replace from this: a, b, c   to this:  "a","b","c"
        gsub(/ *, */, "\",\"", s)
        s = s0 COMMA DQUOTE s DQUOTE

        results[i_result++] = s
        line = substr(line, RSTART + RLENGTH - 1)
    }
}

END {
    for (i = 0; i < i_result; ++i)
        print results[i]
}
于 2013-01-21T22:03:48.867 回答
0

试试这个

 #awk -f myawk.sh temp.txt
 BEGIN { RS="CardDetail"; FS="[=,]"; OFS=","; print "Begin Processing "}
 $0 ~ /baseCardId/ {gsub("]","",$0);print $2, $4 , $6}
 END {print "Process Complete"}
于 2013-01-22T02:46:36.483 回答