regex - 如何从大量 URL 中删除重复域？正则表达式或其他

Question

我最初问了这个问题：Regular Expression in gVim to Remove Duplicate Domains from a List

但是，我意识到，如果我“扩大我的范围”以了解我愿意接受的解决方案，我可能更有可能找到一个可行的解决方案。

所以，我会改写我的问题&也许我会得到一个更好的解决方案......这里是：

我在 .txt 文件中有大量 URL 列表（我正在运行 Windows Vista 32 位），我需要删除重复的域（以及每个重复的整个对应的 URL），同时留下每个域的第一次出现。这个特定文件中大约有 6,000,000 个 URL，格式如下（这些 URL 中显然没有空格，我不得不这样做，因为我这里没有足够的帖子来发布那么多“实时” URL ):

http://www.exampleurl.com/something.php
http://exampleurl.com/somethingelse.htm  
http://exampleurl2.com/another-url  
http://www.exampleurl2.com/a-url.htm  
http://exampleurl2.com/yet-another-url.html  
http://exampleurl.com/  
http://www.exampleurl3.com/here_is_a_url  
http://www.exampleurl5.com/something

无论解决方案是什么，使用上述内容作为输入的输出文件应该是这样的：

http://www.exampleurl.com/something.php  
http://exampleurl2.com/another-url  
http://www.exampleurl3.com/here_is_a_url  
http://www.exampleurl5.com/something

您注意到现在没有重复的域，并且它留下了它遇到的第一次出现。

如果有人可以帮助我，无论是使用正则表达式还是我不知道的某些程序，那都很棒。

不过我会这么说，我没有使用 Windows 操作系统以外的任何东西的经验，所以一个需要 Windows 程序以外的东西的解决方案，可以这么说需要一点“婴儿步”（如果有人愿意这样做的话）。

score 2 · Accepted Answer

Python 中的正则表达式，非常原始，不适用于子域。基本概念是使用字典键和值，键是域名，如果键已经存在，值将被覆盖。

import re

pattern = re.compile(r'(http://?)(w*)(\.*)(\w*)(\.)(\w*)')
urlsFile = open("urlsin.txt", "r")
outFile = open("outurls.txt", "w")
urlsDict = {}

for linein in urlsFile.readlines():
    match = pattern.search(linein)
    url = match.groups()
    domain = url[3]
    urlsDict[domain] = linein

outFile.write("".join(urlsDict.values()))

urlsFile.close()
outFile.close()

您可以扩展它以过滤掉子域，但我认为基本思想就在那里。而对于 600 万个 URL，在 Python 中可能需要相当长的时间......

有些人在遇到问题时会想“我知道，我会使用正则表达式”。现在他们有两个问题。——Jamie Zawinski，在 comp.emacs.xemacs

score 1 · Accepted Answer

对于这种特殊情况，我不会使用正则表达式。URL 是一种定义良好的格式，并且在 BCL 中存在一个易于使用的该格式的解析器：Uri类型。它可用于轻松解析类型并获取您要查找的域信息。

这是一个简单的例子

public List<string> GetUrlWithUniqueDomain(string file) {
  using ( var reader = new StreamReader(file) ) {
    var list = new List<string>();
    var found = new HashSet<string>();
    var line = reader.ReadLine();
    while (line != null) {
      Uri uri;
      if ( Uri.TryCreate(line, UriKind.Absolute, out uri) && found.Add(uri.Host)) {
        list.Add(line);
      }
      line = reader.ReadLine();
    }
  }
  return list;
}

score 1 · Accepted Answer

我会使用 Perl 和正则表达式的组合。我的第一个版本

   use warnings ;
   use strict ;
   my %seen ;
   while (<>) {
       if ( m{ // ( .*? ) / }x ) {
       my $dom = $1 ;

       print unless $seen {$dom} ++ ;
       print "$dom\n" ;
     } else {
       print "Unrecognised line: $_" ;
     }
   }

但这将 www.exampleurl.com 和 exampleurl.com 视为不同。我的第二个版本有

if ( m{ // (?:www\.)? ( .*? ) / }x )

忽略“www”。在前面。您可能可以稍微改进一下正则表达式，但这留给读者。

最后，您可以稍微评论一下正则表达式（/x限定符允许这样做）。这取决于谁将阅读它——它可能被认为过于冗长。

           if ( m{
               //          # match double slash
               (?:www\.)?  # ignore www
               (           # start capture
                  .*?      # anything but not greedy
                )          # end capture
                /          # match /
               }x ) {

我使用m{}而不是//避免/\/\/

score 0 · Accepted Answer

如果没有，请找一个 unix 盒子，或者获取 cygwin
使用 tr 转换 '.' 到 TAB 方便。
使用 sort(1) 按域名部分对行进行排序。通过编写一个 awk 程序来规范 www 部分，这可能会变得更容易一些。

和ça va，你们在一起。使用也许使用 uniq(1) 来查找重复项。

（额外的功劳：为什么一个正则表达式不能单独做到这一点？计算机科学专业的学生应该考虑抽水引理。）

score 0 · Accepted Answer

可以使用以下代码来实现。它将从文本文件中提取所有唯一的域 URL。即使它不是一个有效的解决方案，您也可以使用它多达 100k 个 URL 的列表来获得更快的结果。

from urllib.parse import urlparse
import codecs

all_urls = open('all-urls.txt', encoding='utf-8', errors='ignore').readlines()
print('all urls count = ', len(all_urls))
unique_urls = []

for url in all_urls:
    url = url.strip()
    root_url = urlparse(url).hostname
    is_duplicate = any(str(root_url) in unique_url for unique_url in unique_urls)
    if not is_duplicate:
        unique_urls.append(url)

unique_urls_file = codecs.open('unique-urls.txt', 'w', encoding='utf8')

for unique_url in unique_urls:
    unique_urls_file.write(unique_url + '\n')

unique_urls_file.close()

print('all unique urls count = ', len(unique_urls))

regex - 如何从大量 URL 中删除重复域？正则表达式或其他

5 回答 5

Related

Reference