python - Python: How do I make a for loop append data to a list when the format is non-standard?

Question

I'm looking to read some marker data into data structures using Python. So far, I have successfully read every Marker name into a single list (there are 2,000 of those).

The data I have was originally in Excel, but I converted it into a .txt file.

The header data in the file was removed and assigned to variables using readline().

Every line with a marker name begins with a double quotation mark (") so I was able to easily gain that information and store it as a list.

Each line with the data for that marker is indented 2 spaces and there are lines that begin with either "a" , "b" , or "h". I want to get these into a data structure. I've tried both lists and strings, but both are returned as empty. The data under each marker name is a block with the three letters "a", "b", and "h" with each letter representing an individual in a population (there are 250). The tricky thing is that there are 5 letters separated by a single space, but then those 5-letter blocks are separated from other 5-letter blocks by two spaces.

Example:

"BK_12 (a,h,b) ; 1"
  b a a a b  a b a a a  b a b a a  a a a a a  a a a b b  a a b a h  b   
  a a a a a  a a a a a  a a a a a  a b a a a  a h a a a  a a a a a  h
  a a b a a  a h a a a  a h a h a  a a a a a  a a b a a  a a a a h  a
  a a a b a  a a a a a  a a b a a  b b a b a  h a b a a  a b a a a  h 
  a a a a

That part I don't really need help with, but just included for reference of how the file looks. My ultimate goal is to use phenotype data to find markers associated with a specific phenotype.

I used a for loop to accomplish this so far. My code is below. EDIT: I tried indexing from position 2, rather an searching from position 0 for an empty space. I thought this would work. The else: statement was meant to tell me whether or not it was recognizing the elif statements. Nothing was returned, so I'm assuming it is working in that regard, but it isn't appending.

Markers = []
Genotype_Data = []

for line in infile:
    line=line.rstrip()
    if (line[0] == '"'):
        line=line.rstrip()
        Markers.append(line)
    elif (line[2] == 'a'):
        line=line.rstrip()
        Genotype_Data.append(line)
    elif (line[2] == 'b'):
        line=line.rstrip()
        Genotype_Data.append(line)
    elif (line[2] == 'h'):
        line=line.rstrip()
        Genotype_Data.append(line)
    else:
        print("Something isn't right!")

score 0 · Accepted Answer

我仍然不清楚您希望数据最终以哪种格式出现在Genotype_Data列表中，但您应该能够根据需要调整以下部分：

Markers = []
Genotype_Data = []
INDIVIDUALS = set('abh')

with open('genotype_data.txt', mode='rt') as infile:
    line = infile.next().rstrip()  # read first line of file
    if line[0] == '"':
        Markers.append(line)
    else:
        raise ValueError('marker line expected')

    geno_accumulator = []
    for line in infile:  # read remainder of file
        line = line.rstrip()
        if line[0] == '"':
            Genotype_Data.append(geno_accumulator)
            geno_accumulator = []
            Markers.append(line)
        elif line[2] in INDIVIDUALS:
            geno_accumulator.append(line)
        else:
            raise ValueError('unrecognized line of input data encountered')

    if geno_accumulator:  # append the final bit of genotype data
        Genotype_Data.append(geno_accumulator)

print 'Markers:', Markers
print 'Genotype_Data:', Genotype_Data

score 0 · Accepted Answer

我不明白你的目标是什么。

也许这可以帮助您实现它：

print(line.split()) # just a and b, ...
['b', 'a', 'a', 'a', 'b', 'a', 'b', 'a', 'a', 'a', 'b', 'a', 'b', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'a', 'a', 'b', 'a', 'h', 'b']
>>> print(line.split(' ')) # a b, ... and '' where a new block starts
['', '', 'b', 'a', 'a', 'a', 'b', '', 'a', 'b', 'a', 'a', 'a', '', 'b', 'a', 'b', 'a', 'a', '', 'a', 'a', 'a', 'a', 'a', '', 'a', 'a', 'a', 'b', 'b', '', 'a', 'a', 'b', 'a', 'h', '', 'b', '', '', '']
>>> '  x x  '.strip()
'x x'

python - Python: How do I make a for loop append data to a list when the format is non-standard?

2 回答 2

Related

Reference