0

所以我通常是 Python 的新手,我已经阅读了很多文章,但我仍然不确定如何忽略带有 '#' 的行。

我需要:

  1. 将此 tsv 文件中的四列 (col2-col5) 设为单独的列表。(我将如何选择忽略与夏威夷的行,因为它的数据不完整,因此使用 49 个数据点。)

  2. 然后定义一个函数 Pearson(X,Y),它将两个列表作为参数并返回 Pearson 相关系数。设 X= [x1,x2,...,xn] 且 Y = [y1, y2,....,yn]。X 和 Y 之间的 Pearson 相关系数由下式给出:

r=(nΣxiyi -ΣxiΣyi)/((√(nΣxi^2-(Σxi^2)^2(nΣyi^2-(Σyi)^2))

listT = [26, 24, 23, 14, 15, 19, 21, 22, 18, 17, 16, 23, 24, 21, 20]

listH = [75, 69, 77, 51, 48, 68, 83, 68, 71, 51, 54, 71, 77, 67, 68]

def Pearson(X,Y):
# Do something
    return

# Should print:  var T and var H: 0.8139
print("var T and var H: %.4f"%(Pearson(listT, listH)))

在定义函数时,我将如何写出 Σ 符号?

#------------------------------------------------------------------------
# Data from the CDC -- http://www.cdc.gov/ -- reports on prevalence of
# smoking, incidence of lung cancer, and deaths attributed to smoking.
# 
# Col1: state
# Col2: cases of lung cancer (per 100,000 inhabitants)
# Col3: smoking among adults (%)
# Col4: attempts to quit (%)
# Col5: smoking-related deaths (per 100,000 inhabitants)
#------------------------------------------------------------------------
Alabama 107.1   24.9    47.1    321.1
Alaska  89  24.9    54.2    296.2
Arizona 63.4    18.6    49.4    248.9
Arkansas    105 25.7    45.6    334.1
California  64.4    14.8    51.4    261
Colorado    56.9    20.1    42.4    252.7
Connecticut 81.1    18.1    49  253.8/


Delaware    98.8    24.5    48.7    296
District of Columbia    80.2    21  54.2    257.3
Florida 85.5    20.4    44.2    275.5
Georgia 98.3    20.1    54.8    312.3
Hawaii  68.2    NA  NA  185.1
Idaho   62.7    17.5    47.2    254.1
Illinois    92  22.2    49.3    278.4
Indiana 102.8   25  47.5    322.2
Iowa    91.8    20.8    42.9    256.7
Kansas  84.8    19.8    43.7    270.8
Kentucky    132.6   27.6    47.6    378.1
Louisiana   108 23.6    51.8    309.1
Maine   99.3    21  55.3    303.8
Maryland    80.1    19.7    51.1    279.5
Massachusetts   83.3    18.5    52.5    258.6
Michigan    90  23.4    55.6    296.3
Minnesota   65  20.7    43.6    225.3
Mississippi 115.4   24.6    48.9    343.2
Missouri    103.9   24.1    43  325
Montana 73.1    20.4    45.4    292.6
Nebraska    82.8    20.3    46.7    251.9
Nevada  82.7    23.2    41.4    370.4
New Hampshire   80.6    21.8    53.2    294.8
New Jersey  78.8    18.9    49.6    253.1
New Mexico  57.6    20.3    45.6    250.8
New York    76.7    20  51.5    259.6
North Carolina  104.1   23.2    49.2    307
North Dakota    71.5    19.9    43.9    233
Ohio    97.4    25.9    41.3    310.6
Oklahoma    102.6   26.1    45.1    321.7
Oregon  77.6    20  46  277.5
Pennsylvania    89.4    22.7    47.1    269.1
Rhode Island    84.6    21.3    53.1    283
South Carolina  99.4    24.5    49.1    303.3
South Dakota    78.8    20.3    46.4    253.8
Tennessee   111.1   26.1    46.6    333.6
Texas   83.2    20.6    46.4    287.4
Utah    33.1    10.5    53.7    144.9
Vermont 90.2    20  55.4    272.2
Virginia    86.7    20.9    44.8    288.7
Washington  76.2    19.2    51.6    279.1
West Virginia   120 26.9    46.2    361.6
Wisconsin   75  22  47.7    258.2
Wyoming 57.8    21.7    48.5    294.2

这是我到目前为止所拥有的:

import csv
import operator
import math
import sys

cases_lung_cancer = [] #blank list for 1st column

smoking_adults = [] #blank list for 2nd column

attempts_quit = [] #blank list for 3rd column

smoking_deaths = [] #blank list for 4th column

def Pearson(X, Y)

with open('cdc_data.tsv', newline= ' ') as csv_f:

    for row in csv.DictReader(csv_f, delimiter='\t'):
4

3 回答 3

2
import pandas as pd

data = pd.read_tsv('cdc_data.tsv', header=None)

correlation = data.corr(method='pearson')
于 2014-12-31T23:10:51.540 回答
1

将此 tsv 文件中的四列(col2-col5)分成单独的列表,我选择忽略带有夏威夷的行,因为它的数据不完整,因此使用了 49 个数据点。

col0 = []
col1 = []
col2 = []
col3 = []
col4 = []

f = open('cdc_data.tsv', 'r')
contents = f.read()
lines = contents.split('\n')    # split file into seperate lines
for line in lines:
    if (line[0:1] == '#'):   # filter out comments
       continue

    split_line = line.split('\t')   # split line into seperate words seperated by TAB

    if (len(split_line) < 5): # drop any line that isn't 5 columns
       continue

    # assign each column into a separate list
    col0.append(split_line[0])
    col1.append(split_line[1])
    col2.append(split_line[2])
    col3.append(split_line[3])
    col4.append(split_line[4])

我将把夏威夷的问题和你的#2 问题作为练习留给你来完成。

于 2014-12-31T23:45:28.413 回答
0
import math

col0 = []
col1 = []
col2 = []
col3 = []
col4 = []

f = open('cdc_data.tsv', 'r')

def Pearson(X,Y):
    n=50 
    a=0
    b=0
    c=0 
    d=0
    e=0 
    f=0
    g=0 
    for i in range(n):
        a+=X[i]*Y[i]
        b+=X[i]
        c+=Y[i]
        d+=X[i]**2
        e+=X[i] #remember to square this
        f+=Y[i]**2
        g+=Y[i] #remember to square this

    return (n*a-b*c)/(math.sqrt((n*d-e**2)*(n*f-g**2)))


contents = f.read()

lines = contents.split('\n')    # split file into seperate lines
for line in lines:
    if (line[0:1] == '#'):   # filter out comments
       continue
    if (line[0:1] == 'H'): #filter out Hawaii
        continue

    split_line = line.split('\t')   # split line into seperate words seperated by TAB

    if (len(split_line) < 5): # drop any line that isn't 5 columns
       continue

# assign each column into a separate list
col0.append(split_line[0])
col1.append(float(split_line[1]))
col2.append(float(split_line[2]))
col3.append(float(split_line[3]))
col4.append(float(split_line[4]))


print("Correlation for col1 and col2: %.4f" %(Pearson(col1,col2)))
print("Correlation for col1 and col3: %.4f" %(Pearson(col1,col3)))
print("Correlation for col1 and col4: %.4f" %(Pearson(col1,col4)))
print("Correlation for col2 and col3: %.4f" %(Pearson(col2,col3)))
print("Correlation for col2 and col4: %.4f" %(Pearson(col2,col4)))
print("Correlation for col3 and col4: %.4f" %(Pearson(col3,col4)))
于 2015-01-07T02:22:12.620 回答