0

我有一个小问题,不知道从哪里开始。我有一个包含以下信息的文本文件。

MINI COOPER 2007, 30,000 miles, British Racing Green, full service history, metallic paint, alloys. Great condition. £5,995 ono Telephone xxxxx xxxxx

我需要按以下格式填充上述信息

    <advert>
    <manufacturer></manufacturer>
    <make></make>
    <model></make>
    <price></price>
    <miles></miles>
    <image></image>
    <desc><![CDATA[desc>
    <expiry></expiry> // Any point in the future
    <url></url> // Optional
</advert>
<advert>

输出应该是。

    </advert>
<advert>
    <manufacturer>MINI</manufacturer>
    <make></make>
    <model></make>
    <price>5,995</price>
    <miles>30000</miles>
    <image></image>
    <desc><![CDATA[2007, British Racing Green, full service history, metallic paint, alloys. Great condition.Telephone xxxxxx xxxxxx]]></desc>
    <expiry>Todays date 13/05/2013</expiry>
    <url></url>
</advert>

任何帮助将不胜感激。

4

2 回答 2

1

由于有时逗号是字段的一部分,有时它们不是您不能使用逗号或其他任何内容作为字段分隔符,因此您在 GNU awk 中需要这样的东西(用于 gensub() 和 strftime()):

gawk '{
    print "<advert>"
    printf "\t<manufacturer>%s</manufacturer>\n", $1
    printf "\t<make></make>\n"
    printf "\t<model></model>\n"
    printf "\t<price>%s</price>\n", gensub(/.*£([[:digit:],]+).*/,"\\1","")
    printf "\t<miles>%s</miles>\n", gensub(/.*[[:space:]]([[:digit:],]+)[[:space:]]+miles.*/,"\\1","")
    printf "\t<image></image>\n"
    printf "\t<desc><![CDATA[%s]]></desc>\n", gensub(/.*[[:space:]]+miles[[:space:]]*,[[:space:]]*(.*)/,"\\1","")
    printf "\t<expiry>Todays date %s</expiry>\n", strftime("%d/%m/%Y")
    printf "\t<url></url>\n"
    print "</advert>"
}' file

我的编辑器似乎对英镑符号感到窒息,所以这是使用 # 符号运行的上述脚本:

$ cat file
MINI COOPER 2007, 30,000 miles, British Racing Green, full service history, metallic paint, alloys. Great condition. #5,995 ono Telephone xxxxx xxxxx

$ gawk '{
    print "<advert>"
    printf "\t<manufacturer>%s</manufacturer>\n", $1
    printf "\t<make></make>\n"
    printf "\t<model></model>\n"
    printf "\t<price>%s</price>\n", gensub(/.*#([[:digit:],]+).*/,"\\1","")
    printf "\t<miles>%s</miles>\n", gensub(/.*[[:space:]]([[:digit:],]+)[[:space:]]+miles.*/,"\\1","
")
    printf "\t<image></image>\n"
    printf "\t<desc><![CDATA[%s]]></desc>\n", gensub(/.*[[:space:]]+miles[[:space:]]*,[[:space:]]*(.
*)/,"\\1","")
    printf "\t<expiry>Todays date %s</expiry>\n", strftime("%d/%m/%Y")
    printf "\t<url></url>\n"
    print "</advert>"
}' file
<advert>
        <manufacturer>MINI</manufacturer>
        <make></make>
        <model></model>
        <price>5,995</price>
        <miles>30,000</miles>
        <image></image>
        <desc><![CDATA[British Racing Green, full service history, metallic paint, alloys. Great con
dition. #5,995 ono Telephone xxxxx xxxxx]]></desc>
        <expiry>Todays date 13/05/2013</expiry>
        <url></url>
</advert>
于 2013-05-13T12:16:47.407 回答
0

这是一些示例代码,至少可以帮助您进行。像这样运行:

awk -f script.awk file.txt

内容script.awk

{
    for (i=1;i<=NF;i++) {

        if ($i == "miles,") {
            miles = $(i - 1)

            $i = $(i - 1) = ""
        }

        if ($i ~ /£/) {
            price = substr($i, 2)

            $i = $(i + 1) = ""
        }
    }

    gsub(/ +/, " ");

    print "<advert>"
    print "\t<manufacturer>" $1 "</manufacturer>"
    print "\t<make></make>"
    print "\t<model></make>"
    print "\t<price>" price "</price>"
    print "\t<miles>" miles "</miles>"
    print "\t<image></image>"
    print "\t<desc><![CDATA[" $0 "]></desc>"
    print "\t<expiry>" strftime( "%d/%m/%Y" ) "</expiry>"
    print "\t<url></url>"
    print "</advert>"
}

结果:

<advert>
    <manufacturer>MINI</manufacturer>
    <make></make>
    <model></make>
    <price>5,995</price>
    <miles>30,000</miles>
    <image></image>
    <desc><![CDATA[MINI COOPER 2007, British Racing Green, full service history, metallic paint, alloys. Great condition. Telephone xxxxx xxxx]></desc>
    <expiry>13/05/2013</expiry>
    <url></url>
</advert>
于 2013-05-13T12:18:54.697 回答