0

我的 csv 文件 (test.csv) 内容示例如下: 注意:我的 test.csv 文件大约 60MB。

"Position","Value"
"2545600","19"
"2545601","19"
"2545602","19"
"2545603","19"
"2545604","20"
"2545605","20"
"2545606","21"
"2545607","22"
"2545608","21"
"2545609","20"
"2545610","21"
"2545611","18"
"2545612","19"
"2545613","21"
"2545614","21"
"2545615","21"
"2545616","21"
"2545617","22"
"2545618","25"
"2545619","25"

我的python代码(test.py)如下:

#!/usr/bin/python
import sys

txt = open(sys.argv[1], 'r')
out = open(sys.argv[2], 'w')
mil = float(sys.argv[3])

out.write('chr\tstart\tend\tfeature\t'+sys.argv[2]+'\n')

for line in txt:
    if 'Position' not in line:
        line = line.strip('",\n')
        line = line.split('","')

        line[1] = str(int(line[1])/mil)

        out.write('gi|255767013|ref|NC_000964.3|\t'+line[0]+'\t'+line[0]+'\t\t'+line[1]+'\n')

txt.close()
out.close()

我的命令行:

python test.py test.csv test.igv 5

运行命令后出现错误:

Traceback (most recent call last):
  File "test.py", line 15, in <module>
    line[1] = str(int(line[1])/mil)
ValueError: invalid literal for int() with base 10: '3"\r'

但是,如果我创建一个新的空 csv 文件,即 small.csv 并从我的 test.csv 文件中仅复制/粘贴几行(如上面的示例)。然后它成功运行该命令。

python test.py small.csv small.igv 5

输入 small.csv:

"Position","Value"
"2545600","19"
"2545601","19"
"2545602","19"
"2545603","19"
"2545604","20"
"2545605","20"
"2545606","21"
"2545607","22"
"2545608","21"
"2545609","20"

输出small.igv:

chr start   end feature small.igv
gi|255767013|ref|NC_000964.3|   2545600 2545600     3.8
gi|255767013|ref|NC_000964.3|   2545601 2545601     3.8
gi|255767013|ref|NC_000964.3|   2545602 2545602     3.8
gi|255767013|ref|NC_000964.3|   2545603 2545603     3.8
gi|255767013|ref|NC_000964.3|   2545604 2545604     4.0
gi|255767013|ref|NC_000964.3|   2545605 2545605     4.0
gi|255767013|ref|NC_000964.3|   2545606 2545606     4.2
gi|255767013|ref|NC_000964.3|   2545607 2545607     4.4
gi|255767013|ref|NC_000964.3|   2545608 2545608     4.2
gi|255767013|ref|NC_000964.3|   2545609 2545609     4.0

这就是我想要的。那么问题来了,为什么我不能在更大的 csv 文件上做呢?

4

3 回答 3

4

尝试

for line in ..... :
     line = line.strip()

这将从行字符串中删除行尾。

更好的解决方案:使用 Python 的 csv 模块为您处理这些方面。

于 2013-01-21T19:22:37.797 回答
1

在这种情况下,使用csv模块要好得多。从 csv 文件读取的每一行都作为字符串列表返回。不会出现剥离空格的问题,你可以在函数的参数中指定分隔符(这里不需要)csv.reader

import csv
import sys

out = open(sys.argv[2], 'w')
mil = float(sys.argv[3])

out.write('chr\tstart\tend\tfeature\t'+sys.argv[2]+'\n')
with open(sys.argv[1], 'rb') as f:
    reader = csv.reader(f, delimiter=',')
    headers = reader.next()    # Consider headers separately
    for line in reader:
        line[1] = str(int(line[1])/mil)
        out.write('gi|255767013|ref|NC_000964.3|\t'+line[0]+'\t'+line[0]+'\t\t'+line[1]+'\n')
out.close()

python test.py test.csv test.igv 5 && cat test.igv应该显示预期的输出。

于 2013-01-21T19:48:25.453 回答
0

正如所建议的那样,csv模块更有帮助。

例如:

import csv
f = open("ex.csv")
for line in csv.reader(f):
    print line

和数据

"Position","Value"
"2545600","19"
"2545601","19"
"2545602","19"
"2545603","19"

给出的输出

['Position', 'Value']
['2545600', '19']
['2545601', '19']
['2545602', '19']
['2545603', '19']

这更易于管理。

csv模块也可以编写 csv 文件。

于 2013-01-21T19:31:35.330 回答