python - Python如何根据子字符串过滤字符串

Question

我是来自 Java 世界的 Python 新手。

我正在尝试编写一个简单的 python 函数，它只打印出 CSV 或“arff”文件的数据行。非数据行以这 3 个模式 @ 、 [@ 、 [% 开头，不应打印此类行。

示例数据文件片段：

% 1. Title: Iris Plants Database
% 
% 2. Sources:

%      (a) Creator: R.A. Fisher
%      (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
%      (c) Date: July, 1988

@RELATION iris

@ATTRIBUTE sepallength  REAL
@ATTRIBUTE sepalwidth   REAL
@ATTRIBUTE petallength  REAL
@ATTRIBUTE petalwidth   REAL
@ATTRIBUTE class    {Iris-setosa,Iris-versicolor,Iris-virginica}

@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa

Python脚本：

import csv
def loadCSVfile (path):
    csvData = open(path, 'rb') 
    spamreader = csv.reader(csvData, delimiter=',', quotechar='|')
    for row in spamreader:
        if row.__len__ > 0:
            #search the string from index 0 to 2 and if these substrings(@ ,'[\'%' , '[\'@') are not found, than print the row
            if (str(row).find('@',0,1) & str(row).find('[\'%',0,2) & str(row).find('[\'@',0,2) != 1):
                print str(row)

loadCSVfile('C:/Users/anaim/Desktop/Data Mining/OneR/iris.arff')

实际输出：

['% 1. Title: Iris Plants Database']
['% ']
['% 2. Sources:']
['%      (a) Creator: R.A. Fisher']
['%      (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)']
['%      (c) Date: July', ' 1988']
['% ']
[]
['@RELATION iris']
[]
['@ATTRIBUTE sepallength\tREAL']
['@ATTRIBUTE sepalwidth \tREAL']
['@ATTRIBUTE petallength \tREAL']
['@ATTRIBUTE petalwidth\tREAL']
['@ATTRIBUTE class \t{Iris-setosa', 'Iris-versicolor', 'Iris-virginica}']
[]
['@DATA']
['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']
['4.7', '3.2', '1.3', '0.2', 'Iris-setosa']
['4.6', '3.1', '1.5', '0.2', 'Iris-setosa']
['5.0', '3.6', '1.4', '0.2', 'Iris-setosa']
['5.4', '3.9', '1.7', '0.4', 'Iris-setosa']
['4.6', '3.4', '1.4', '0.3', 'Iris-setosa']
['5.0', '3.4', '1.5', '0.2', 'Iris-setosa']

期望的输出：

['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']
['4.7', '3.2', '1.3', '0.2', 'Iris-setosa']
['4.6', '3.1', '1.5', '0.2', 'Iris-setosa']
['5.0', '3.6', '1.4', '0.2', 'Iris-setosa']
['5.4', '3.9', '1.7', '0.4', 'Iris-setosa']
['4.6', '3.4', '1.4', '0.3', 'Iris-setosa']
['5.0', '3.4', '1.5', '0.2', 'Iris-setosa']

score 2 · Accepted Answer

要测试一行是否为空，只需在布尔上下文中使用它；空列表是错误的。

要测试字符串是否以某些特定字符开头，请使用str.startswith()，它可以采用单个字符串或字符串元组：

import csv
def loadCSVfile (path):
    with open(path, 'rb') as csvData:
        spamreader = csv.reader(csvData, delimiter=',', quotechar='|')
        for row in spamreader:
            if row and not row[0].startswith(('%', '@')):
                print row

因为您实际上是在测试固定宽度的字符串，所以您也可以只对第一列进行切片并in针对序列进行测试；一组将是最有效的：

def loadCSVfile (path):
    ignore = {'@', '%'}
    with open(path, 'rb') as csvData:
        spamreader = csv.reader(csvData, delimiter=',', quotechar='|')
        for row in spamreader:
            if row and not row[0][:1] in ignore:
                print row

这里[:1]切片表示法返回row[0]列的第一个字符（如果第一列为空，则返回空字符串）。

我使用打开的文件对象作为上下文管理器 ( with ... as ...)，以便 Python 在代码块完成（或引发异常）时自动为我们关闭文件。

您永远不应该直接调用双下划线方法（“dunder”方法或特殊方法），len(row)而是应该使用正确的 API 调用。

演示：

>>> loadCSVfile('/tmp/iris.arff')
['5.1', '3.5', '1.4', '0.2', 'Iris-setosa']
['4.9', '3.0', '1.4', '0.2', 'Iris-setosa']
['4.7', '3.2', '1.3', '0.2', 'Iris-setosa']
['4.6', '3.1', '1.5', '0.2', 'Iris-setosa']
['5.0', '3.6', '1.4', '0.2', 'Iris-setosa']
['5.4', '3.9', '1.7', '0.4', 'Iris-setosa']

score 0 · Accepted Answer

我会利用 in 运算符和 Python 列表理解。

这就是我的意思：

import csv

def loadCSVfile (path):
    exclusions = ['@', '%', '\n', '[@' , '[%']
    csvData = open(path, 'r')
    spamreader = csv.reader(csvData, delimiter=',', quotechar='|')      

    lines = [line for line in spamreader if ( line and line[0][0:1] not in exclusions and line[0][0:2] not in exclusions )]

    for line in lines:
        print(line)


loadCSVfile('C:/Users/anaim/Desktop/Data Mining/OneR/iris.arff')

python - Python如何根据子字符串过滤字符串

2 回答 2

Related

Reference