8

我一直在尝试从 URL 列表中删除所有 utm_* 参数。我发现的最接近的是:https ://gist.github.com/626834 。

有任何想法吗?

4

5 回答 5

8

它有点长,但使用 url* 模块,并避免使用 re's。

from urllib import urlencode
from urlparse import urlparse, parse_qs, urlunparse

url = 'http://whatever.com/somepage?utm_one=3&something=4&utm_two=5&utm_blank&something_else'

parsed = urlparse(url)
qd = parse_qs(parsed.query, keep_blank_values=True)
filtered = dict( (k, v) for k, v in qd.iteritems() if not k.startswith('utm_'))
newurl = urlunparse([
    parsed.scheme,
    parsed.netloc,
    parsed.path,
    parsed.params,
    urlencode(filtered, doseq=True), # query string
    parsed.fragment
])

print newurl
# 'http://whatever.com/somepage?something=4&something_else'
于 2012-07-24T23:03:47.143 回答
1
import re
from urlparse import urlparse, urlunparse

url = 'http://www.someurl.com/page.html?foo=bar&utm_medium=qux&baz=qoo'
parsed_url = list(urlparse(url))
parsed_url[4] = '&'.join(
    [x for x in parsed_url[4].split('&') if not re.match(r'utm_', x)])
utmless_url = urlunparse(parsed_url)

print utmless_url  # 'http://www.someurl.com/page.html?foo=bar&baz=qoo'
于 2012-07-24T23:00:08.930 回答
1

简单,有效,并且根据您发布的链接,但它是重新......所以,不确定它不会因为某种我无法想到的原因而中断:)

import re

def trim_utm(url):
    if "utm_" not in url:
        return url
    matches = re.findall('(.+\?)([^#]*)(.*)', url)
    if len(matches) == 0:
        return url
    match = matches[0]
    query = match[1]
    sanitized_query = '&'.join([p for p in query.split('&') if not p.startswith('utm_')])
    return match[0]+sanitized_query+match[2]

if __name__ == "__main__":
    tests = [   "http://localhost/index.php?a=1&utm_source=1&b=2",
                "http://localhost/index.php?a=1&utm_source=1&b=2#hash",
                "http://localhost/index.php?a=1&utm_source=1&b=2&utm_something=no#hash",
                "http://localhost/index.php?a=1&utm_source=1&utm_a=yes&b=2#hash",
                "http://localhost/index.php?utm_a=a",
                "http://localhost/index.php?a=utm_a",
                "http://localhost/index.php?a=1&b=2",
                "http://localhost/index.php",
                "http://localhost/index.php#hash2"
            ]

    for t in tests:
        trimmed = trim_utm(t)
        print t
        print trimmed
        print 
于 2012-07-24T23:12:11.883 回答
1

这个怎么样。很好很简单:

url = 'https://searchengineland.com/amazon-q3-ad-revenues-surpass-1-billion-roughly-2x-early-2016-285763?utm_source=feedburner&utm_medium=feed&utm_campaign=feed-main'

print url[:url.find('?utm')]

https://searchengineland.com/amazon-q3-ad-revenues-surpass-1-billion-roughly-2x-early-2016-285763
于 2017-10-29T09:03:46.960 回答
0

使用正则表达式

import re
def clean_url(url):
    return re.sub(r'(?<=[?&])utm_[^&]+&?', '', url)

这是怎么回事?我们使用正则表达式来查找字符串的所有实例,这些实例看起来像 utm_somekey=somevalue ,前面有“?” 或者 ”&”。

测试它:

tests = [   "http://localhost/index.php?a=1&utm_source=1&b=2",
            "http://localhost/index.php?a=1&utm_source=1&b=2#hash",
            "http://localhost/index.php?a=1&utm_source=1&b=2&utm_something=no#hash",
            "http://localhost/index.php?a=1&utm_source=1&utm_a=yes&b=2#hash",
            "http://localhost/index.php?utm_a=a",
            "http://localhost/index.php?a=utm_a",
            "http://localhost/index.php?a=1&b=2",
            "http://localhost/index.php",
            "http://localhost/index.php#hash2"
        ]

for t in tests:
    print(clean_url(t))

http://localhost/index.php?a=1&b=2
http://localhost/index.php?a=1&b=2#hash
http://localhost/index.php?a=1&b=2&
http://localhost/index.php?a=1&b=2#hash
http://localhost/index.php?
http://localhost/index.php?a=utm_a
http://localhost/index.php?a=1&b=2
http://localhost/index.php
http://localhost/index.php#hash2
于 2021-02-15T21:49:48.237 回答