sorting - 根据特定字符串删除重复行

Question

如何删除基于特定字符串或字符的重复行？

例如，我有一个包含以下内容的文件：

https://example.com/?first=one&second=two&third=three
https://example.com/?first=only&second=cureabout&third=theparam
https://example.com/?fourth=four&fifth=five
https://stack.com/?sixth=six&seventh=seven&eighth=eight
https://stack.com/?sixth=itdoesnt&seventh=matter&eighth=something

我希望它根据字符串参数使行唯一，并打印唯一一个具有相同参数的 URL，当然还可以识别它们的域。价值观并不重要。

期望的结果：

https://example.com/?first=one&second=two&third=three
https://stack.com/?sixth=six&seventh=seven&eighth=eight

更新

在下面的代码中，我尝试在之前 grep 3 个字符=，如果行包含该特定字符，则使用唯一行并打印结果。实际上，如果文件具有一定数量的相似参数，则目标是使文件唯一。

for url in $(cat $1); do

    # COUNT NUMBER OF EQUAL CHARACTER "="
    count_eq=$(echo $url | sed "s/=/=\n/g" | grep -a -c '=')
    if [[ $count_eq == "3" ]]; then

        # GREP 3 CHARACTERS BEFORE "="
        same_param=$(printf $url | grep -o -P '.{0,3}=.{0,0}' | sort -u)
    
        if [[ $url == *"$same_param"* ]];then
            sort -u "$url" | printf "$url\n"
        fi
    fi

done

谢谢。

score 1 · Accepted Answer

你可以试试下面的代码

awk '!a[$0]++' file

它只是检查数组中是否不存在一行然后打印它

score 0 · Accepted Answer

两步法可能是最容易理解的。

首先打印出 first 和 second 的值以及 url：

< a.txt awk -F'[?&]' '{for(i=2;i<=NF;i++){split($i,a,"=");p[a[1]]=a[2]};
                      print $1" "p["first"]" "p["second"]}'
https://example.com/ one two
https://example.com/ one two
https://example.com/ one two
https://stack.com/ one two
https://stack.com/ one two

现在改变最后的print语句，把它变成一个过滤器：

< a.txt awk -F'[?&]' '{for(i=2;i<=NF;i++){split($i,a,"=");p[a[1]]=a[2]};
                      !seen[$1""p["first"]""p["second"]]++'
https://example.com/?first=one&second=two&third=three
https://stack.com/?sixth=six&seventh=seven&eighth=eight

在评论中，您要求提供一个通用解决方案，该解决方案考虑到每个参数，而不仅仅是firstand second。

我会为此使用Python：

#!/usr/bin/python3

# test.py

import sys
from urllib.parse import urlparse, parse_qsl

seen = {}
for line in sys.stdin:
    url = urlparse(line.strip())
    # create a search lookup of sorted parameters, scheme and domain
    sorted_params = sorted(parse_qsl(url.query), key=lambda x:x[0])
    check_str = '{}://{}?{}'.format(
        url.scheme,
        url.netloc,
        '&'.join(['='.join(p) for p in sorted_params]),
    )
    # check if this combination of parameters and values has been seen before
    if check_str not in seen:
        seen[check_str] = 1
        print(line.strip())

像这样运行它：

< input.file python3 test.py

sorting - 根据特定字符串删除重复行

2 回答 2

Related

Reference