0

I have a 10 000 lines source code with tons of duplication. So I read in the file as text.

Example:

    assert PyArray_TYPE(real0) == np.NPY_DOUBLE, "real0 is not double"
    assert real0.ndim == 1, "real0 has wrong dimensions"
    if not (PyArray_FLAGS(real0) & np.NPY_C_CONTIGUOUS):
        real0 = PyArray_GETCONTIGUOUS(real0)
    real0_data = <double*>real0.data

I want to replace all occurances of this pattern with

    real0_data = _get_data(real0, "real0")

where real0 can be any variable name [a-z0-9]+


So don't get confused by the source code. The code doesn't matter, this is text processing and regex.

This is what I have so far:

    PATH = "func.pyx"
    source_string = open(PATH,"r").read()

    pattern = r"""
    assert PyArray_TYPE\(([a-z0-9]+)\) == np.NPY_DOUBLE, "([a-z0-9]+) is not double"
    assert ([a-z0-9]+).ndim == 1, "([a-z0-9]+) has wrong dimensions"
    if not (PyArray_FLAGS(([a-z0-9]+)) & np.NPY_C_CONTIGUOUS):
       ([a-z0-9]+) = PyArray_GETCONTIGUOUS(([a-z0-9]+))
    ([a-z0-9]+)_data = ([a-z0-9]+).data"""

    

4

1 回答 1

1

您可以在任何支持多行正则表达式搜索和替换的文本编辑器中执行此操作。

我使用Komodo IDE来测试它,因为它包含一个优秀的正则表达式测试器(“Rx Toolkit”),用于试验正则表达式。我想也有一些像这样的在线工具。相同的正则表达式适用于免费的Komodo Edit。它也应该在支持 Perl 兼容的正则表达式的大多数其他编辑器中工作。

在 Komodo 中,我使用选中了 Regex 选项的替换对话框来查找:

assert PyArray_TYPE\((\w+)\) == np\.NPY_DOUBLE, "\1 is not double"\s*\n\s*assert \1\.ndim == 1, "\1 has wrong dimensions"\s*\n\s*if not \(PyArray_FLAGS\(\1\) & np\.NPY_C_CONTIGUOUS\):\s*\n\s*\1 = PyArray_GETCONTIGUOUS\(\1\)\s*\n\s*\1_data = <double\*>\1\.data

并将其替换为:

\1_data = _get_data(\1, "\1")

鉴于此测试代码:

    assert PyArray_TYPE(real0) == np.NPY_DOUBLE, "real0 is not double"
    assert real0.ndim == 1, "real0 has wrong dimensions"
    if not (PyArray_FLAGS(real0) & np.NPY_C_CONTIGUOUS):
        real0 = PyArray_GETCONTIGUOUS(real0)
    real0_data = <double*>real0.data

    assert PyArray_TYPE(real1) == np.NPY_DOUBLE, "real1 is not double"
    assert real1.ndim == 1, "real1 has wrong dimensions"
    if not (PyArray_FLAGS(real1) & np.NPY_C_CONTIGUOUS):
        real1 = PyArray_GETCONTIGUOUS(real1)
    real1_data = <double*>real1.data

    assert PyArray_TYPE(real2) == np.NPY_DOUBLE, "real2 is not double"
    assert real2.ndim == 1, "real2 has wrong dimensions"
    if not (PyArray_FLAGS(real2) & np.NPY_C_CONTIGUOUS):
        real2 = PyArray_GETCONTIGUOUS(real2)
    real2_data = <double*>real2.data

结果是:

    real0_data = _get_data(real0, "real0")

    real1_data = _get_data(real1, "real1")

    real2_data = _get_data(real2, "real2")

那么我是如何从您的原始代码中获得该正则表达式的呢?

  1. (为、).*with的所有实例添加前缀\以转义它们(简单的手动搜索和替换)。
  2. 用替换第一个实例。这匹配并捕获一串字母数字字符。real0(\w+)
  3. real0用替换剩余的实例\1。这与 捕获的文本相匹配(\w+)
  4. 用 .替换每个换行符和下一行的前导空格\s*\n\s*。这匹配行上的任何尾随空格,加上换行符,再加上下一行的所有前导空格。这样,无论匹配的代码的嵌套级别如何,正则表达式都可以工作。

最后,“替换”文本\1在需要原始捕获文本的地方使用。

如果你想这样做,你当然可以在 Python 中使用类似的正则表达式。我建议使用\w而不是[a-z0-9]仅仅为了让它更简单。另外,不要包含换行符和前导空格;而是使用\s*\n\s*我使用的方法而不是多行字符串。这样它将独立于我上面提到的嵌套级别。

于 2013-04-18T18:08:36.350 回答