python - Python - parse IPv4 addresses from string (even when censored)

Question

Objective: Write Python 2.7 code to extract IPv4 addresses from string.

String content example:

The following are IP addresses: 192.168.1.1, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. and these censorship methods could apply to any of the dots (Ex: 192[.]168[.]1[.]1).

As you can see from the above, I am struggling to find a way to parse through a txt file that may contain IPs depicted in multiple forms of "censorship" (to prevent hyper-linking).

I'm thinking that a regex expression is the way to go. Maybe say something along the lines of; any grouping of four ints 0-255 or 000-255 separated by anything in the 'separators list' which would consist of periods, brackets, parenthesis, or any of the other aforementioned examples. This way, the 'separators list' could be updated at as needed.

Not sure if this is the proper way to go or even possible so, any help with this is greatly appreciated.

Update: Thanks to recursive's answer below, I now have the following code working for the above example. It will...

find the IPs
place them into a list
clean them of the spaces/braces/etc
and replace the uncleaned list entry with the cleaned one.

Caveat: The code below does not account for incorrect/non-valid IPs such as 192.168.0.256 or 192.168.1.2.3 Currently, it will drop the trailing 6 and 3 from the aforementioned. If its first octet is invalid (ex:256.10.10.10) it will drop the leading 2 (resulting in 56.10.10.10).

import re

def extractIPs(fileContent):
    pattern = r"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)([ (\[]?(\.|dot)[ )\]]?(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3})"
    ips = [each[0] for each in re.findall(pattern, fileContent)]   
    for item in ips:
        location = ips.index(item)
        ip = re.sub("[ ()\[\]]", "", item)
        ip = re.sub("dot", ".", ip)
        ips.remove(item)
        ips.insert(location, ip) 
    return ips

myFile = open('***INSERT FILE PATH HERE***')
fileContent = myFile.read()

IPs = extractIPs(fileContent)
print "Original file content:\n{0}".format(fileContent)
print "--------------------------------"
print "Parsed results:\n{0}".format(IPs)

score 9 · Accepted Answer

这是一个有效的正则表达式：

import re
pattern = r"((([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])[ (\[]?(\.|dot)[ )\]]?){3}([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5]))"
text = "The following are IP addresses: 192.168.1.1, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. "
ips = [match[0] for match in re.findall(pattern, text)]
print ips

# output: ['192.168.1.1', '8.8.8.8', '101.099.098.000', '192.168.1[.]1', '192.168.1(.)1', '192.168.1[dot]1', '192.168.1(dot)1', '192 .168 .1 .1', '192. 168. 1. 1']

正则表达式有几个主要部分，我将在这里解释：

([01]?[0-9]?[0-9]|2[0-4][0-9]|25[0-5])
这与 IP 地址的数字部分匹配。 |意思是“或”。第一种情况处理从 0 到 199 的数字，带或不带前导零。后两个案例处理超过 199 的数字。
[ (\[]?(\.|dot)[ )\]]?
这与“点”部分匹配。有三个子组件：
- [ (\[]?点的“前缀”。可以是空格、开放括号或开放方括号。尾随?意味着这部分是可选的。
- (\.|dot)“点”或句号。
- [ )\]]?“后缀”。与前缀相同的逻辑。
{3}表示重复前一个组件 3 次。
最后一个元素是另一个数字，与第一个相同，只是后面没有点。

score 3 · Accepted Answer

描述

这个正则表达式将匹配一个看起来像 IP 地址的四个八位字节中的每一个。每个八位字节都将被放入它自己的捕获组中以进行收集。

(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])\D{1,5}(2[0-4][0-9]|[01]?[0-9]?[0-9]|25[0-5])

在此处输入图像描述

鉴于以下示例文本，此正则表达式将完整匹配所有 10 个嵌入式 IP 字符串，包括第一个。工作示例：http ://www.rubular.com/r/1MbGZOhuj5

The following are IP addresses: 192.168.1.222, 8.8.8.8, 101.099.098.000. These can also appear as 192.168.1[.]1 or 192.168.1(.)1 or 192.168.1[dot]1 or 192.168.1(dot)1 or 192 .168 .1 .1 or 192. 168. 1. 1. and these censorship methods could apply to any of the dots (Ex: 192[.]168[.]1[.]1).

可以迭代生成的匹配，并且可以通过用点连接 4 个捕获组来构造正确格式化的 IP 字符串。

score 1 · Accepted Answer

下面的代码将...

即使经过审查也能在字符串中找到 IP（例如：192.168.1[dot]20 或 10.10.10 .21）
将它们放入列表中
清除审查制度（空格/大括号/括号）
并将未清理的列表条目替换为已清理的条目。

警告：下面的代码不考虑不正确/无效的 IP，例如 192.168.0.256 或 192.168.1.2.3 目前，它将删除尾随数字（上述数字中的 6 和 3）。如果它的第一个八位字节无效（例如：256.10.10.10），它将丢弃前导数字（导致 56.10.10.10）。


import re

def extractIPs(fileContent):
    pattern = r"((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)([ (\[]?(\.|dot)[ )\]]?(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3})"
    ips = [each[0] for each in re.findall(pattern, fileContent)]   
    for item in ips:
        location = ips.index(item)
        ip = re.sub("[ ()\[\]]", "", item)
        ip = re.sub("dot", ".", ip)
        ips.remove(item)
        ips.insert(location, ip) 
    return ips


myFile = open('***INSERT FILE PATH HERE***')
fileContent = myFile.read()

IPs = extractIPs(fileContent)
print "Original file content:\n{0}".format(fileContent)
print "--------------------------------"
print "Parsed results:\n{0}".format(IPs)

score 0 · Accepted Answer

提取和分类 IPv4 地址（即使经过审查）

注意：这只是我为提取 IPv4 地址而编写的一个类的实现。将来我可能会使用此功能的方法更新我的课程。你可以在我的 GitHub 页面上找到它。

我在下面展示的是以下内容：

清理你的字符串内容示例
将您的字符串数据放入列表中
使用ExtractIPs()类解析和分类 IPv4 地址
- 此类返回一个包含 4 个列表的字典：
  - 有效的 IPv4 地址
  - 公共 IPv4 地址
  - 私有 IPv4 地址
  - IPv4 地址无效

ExtractIPs 类

#!/usr/bin/env python

"""Extract and Classify IP Addresses."""

import re  # Use Regular Expressions.


__program__ = "IPAddresses.py"
__author__ = "Johnny C. Wachter"
__copyright__ = "Copyright (C) 2014 Johnny C. Wachter"
__license__ = "MIT"
__version__ = "0.0.1"
__maintainer__ = "Johnny C. Wachter"
__contact__ = "wachter.johnny@gmail.com"
__status__ = "Development"


class ExtractIPs(object):

    """Extract and Classify IP Addresses From Input Data."""

    def __init__(self, input_data):
        """Instantiate the Class."""

        self.input_data = input_data

        self.ipv4_results = {
            'valid_ips': [],  # Store all valid IP Addresses.
            'invalid_ips': [],  # Store all invalid IP Addresses.
            'private_ips': [],  # Store all Private IP Addresses.
            'public_ips': []  # Store all Public IP Addresses.
        }

    def extract_ipv4_like(self):
        """Extract IP-like strings from input data.
        :rtype : list
        """

        ipv4_like_list = []

        ip_like_pattern = re.compile(r'([0-9]{1,3}\.){3}([0-9]{1,3})')

        for entry in self.input_data:

            if re.match(ip_like_pattern, entry):

                if len(entry.split('.')) == 4:

                    ipv4_like_list.append(entry)

        return ipv4_like_list

    def validate_ipv4_like(self):
        """Validate that IP-like entries fall within the appropriate range."""

        if self.extract_ipv4_like():

            # We're gonna want to ignore the below two addresses.
            ignore_list = ['0.0.0.0', '255.255.255.255']

            # Separate the Valid from Invalid IP Addresses.
            for ipv4_like in self.extract_ipv4_like():

                # Split the 'IP' into parts so each part can be validated.
                parts = ipv4_like.split('.')

                # All part values should be between 0 and 255.
                if all(0 <= int(part) < 256 for part in parts):

                    if not ipv4_like in ignore_list:

                        self.ipv4_results['valid_ips'].append(ipv4_like)

                else:

                    self.ipv4_results['invalid_ips'].append(ipv4_like)

        else:
            pass

    def classify_ipv4_addresses(self):
        """Classify Valid IP Addresses."""

        if self.ipv4_results['valid_ips']:

            # Now we will classify the Valid IP Addresses.
            for valid_ip in self.ipv4_results['valid_ips']:

                private_ip_pattern = re.findall(

                    r"""^10\.(\d{1,3}\.){2}\d{1,3}

                    (^127\.0\.0\.1)|  # Loopback

                    (^10\.(\d{1,3}\.){2}\d{1,3})|  # 10/8 Range

                    # Matching the 172.16/12 Range takes several matches
                    (^172\.1[6-9]\.\d{1,3}\.\d{1,3})|
                    (^172\.2[0-9]\.\d{1,3}\.\d{1,3})|
                    (^172\.3[0-1]\.\d{1,3}\.\d{1,3})|

                    (^192\.168\.\d{1,3}\.\d{1,3})|  # 192.168/16 Range

                    # Match APIPA Range.
                    (^169\.254\.\d{1,3}\.\d{1,3})

                    # VERBOSE for a clean look of this RegEx.
                    """, valid_ip, re.VERBOSE
                )

                if private_ip_pattern:

                    self.ipv4_results['private_ips'].append(valid_ip)

                else:
                    self.ipv4_results['public_ips'].append(valid_ip)

        else:
            pass

    def get_ipv4_results(self):
        """Extract and classify all valid and invalid IP-like strings.
        :returns : dict
        """

        self.extract_ipv4_like()
        self.validate_ipv4_like()
        self.classify_ipv4_addresses()

        return self.ipv4_results

审查提取示例

censored = re.compile(
    r"""

    \(\.\)|
    \(dot\)|
    \[\.\]|
    \[dot\]|
    ( \.)

    """, re.VERBOSE | re.IGNORECASE
)

data_list = input_string.split()  # Bring your input string to a list.

clean_list = []  # List to store the cleaned up input.

for entry in data_list:

    # Remove undesired leading and trailing characters.
    clean_entry = entry.strip(' .,<>?/[]\\{}"\'|`~!@#$%^&*()_+-=')

    clean_list.append(clean_entry)  # Add the entry to the clean list.

clean_unique_list = list(set(clean_list))  # Remove duplicates in list.

# Now we can go ahead and extract IPv4 Addresses. Note that this will be a dict.
results = ExtractIPs(clean_list).get_ipv4_results()

for k, v in results.iteritems():

    # After all that work, make sure the results are nicely presented!
    print("\n%s: %s" % (k, v))

结果：

public_ips: ['8.8.8.8', '101.099.098.000']

valid_ips: ['192.168.1.1', '8.8.8.8', '101.099.098.000']

invalid_ips: []

private_ips: ['192.168.1.1']

python - Python - parse IPv4 addresses from string (even when censored)

4 回答 4

描述

Related

Reference