python - NumPy 或 SciPy 计算加权中位数

Question

我正在尝试使 JMP 执行的过程自动化（分析-> 分布，输入 A 列作为“Y 值”，使用后续列作为“权重”值）。在 JMP 中，您必须一次处理一列 - 我想使用 Python 循环遍历所有列并创建一个数组，例如显示每列的中位数。

例如，如果质量数组为 [0, 10, 20, 30]，第 1 列的权重数组为 [30, 191, 9, 0]，则质量数组的加权中位数应为 10。但是，I '不知道如何得出这个答案。

到目前为止我已经

导入 csv，将权重显示为数组，掩码值为 0，以及
创建了一个形状和大小与权重数组 (113x32) 相同的“Y 值”数组。我不完全确定我需要这样做，但认为它比用于加权目的的 for 循环更容易。

我不确定从这里去哪里。基本上，“Y 值”是一个质量范围，数组中的所有列都代表为每个质量找到的数据点的数量。我需要根据报告的频率找到质量中值。

我不是 Python 或统计方面的专家，所以如果我遗漏了任何有用的细节，请告诉我！

更新：这是我到目前为止所做的一些代码：

#Boilerplate & Import files
import csv
import scipy as sp
from scipy import stats
from scipy.stats import norm
import numpy as np
from numpy import genfromtxt
import pandas as pd
import matplotlib.pyplot as plt

inputFile = '/Users/cl/prov.csv'
origArray = genfromtxt(inputFile, delimiter = ",")
nArray = np.array(origArray)
dimensions = nArray.shape
shape = np.asarray(dimensions)

#Mask values ==0
maTest = np.ma.masked_equal(nArray,0)

#Create array of masses the same shape as the weights (nArray)
fieldLength = shape[0]
rowLength = shape[1]

for i in range (rowLength):
    createArr = np.arange(0, fieldLength*10, 10)
    nCreateArr = np.array(createArr)
    massArr.append(nCreateArr)
    nCreateArr = np.array(massArr)
nmassArr = nCreateArr.transpose()

score 7 · Accepted Answer

如果我正确理解了您的问题，我们可以做什么。就是总结观察结果，除以 2 就可以得出中位数对应的观察值。从那里我们需要弄清楚这个数字是什么观察结果。

这里的一个技巧是使用 np.cumsum 计算观察和。这给了我们一个运行的累积总和。

示例：
np.cumsum([1,2,3,4]) -> [ 1, 3, 6, 10]
每个元素是所有先前元素及其自身的总和。我们在这里有 10 个观察结果。所以平均值将是第 5 个观察值。（我们通过将最后一个元素除以 2 得到 5）。
现在查看 cumsum 结果，我们可以很容易地看出，这一定是第二个和第三个元素之间的观察（观察 3 和 6）。

所以我们需要做的就是找出中位数（5）适合的索引。
np.searchsorted正是我们需要的。它将找到将元素插入数组的索引，以便保持排序。

这样做的代码是这样的：

import numpy as np
#my test data
freq_count = np.array([[30, 191, 9, 0], [10, 20, 300, 10], [10,20,30,40], [100,10,10,10], [1,1,1,100]])

c = np.cumsum(freq_count, axis=1) 
indices = [np.searchsorted(row, row[-1]/2.0) for row in c]
masses = [i * 10 for i in indices] #Correct if the masses are indeed 0, 10, 20,...

#This is just for explanation.
print "median masses is:",  masses
print freq_count
print np.hstack((c, c[:, -1, np.newaxis]/2.0))

输出将是：

median masses is: [10 20 20  0 30]  
[[ 30 191   9   0]  <- The test data
 [ 10  20 300  10]  
 [ 10  20  30  40]  
 [100  10  10  10]  
 [  1   1   1 100]]  
[[  30.   221.   230.   230.   115. ]  <- cumsum results with median added to the end.
 [  10.    30.   330.   340.   170. ]     you can see from this where they fit in.
 [  10.    30.    60.   100.    50. ]  
 [ 100.   110.   120.   130.    65. ]  
 [   1.     2.     3.   103.    51.5]]

score 3 · Accepted Answer

wquantiles是一个小型 python 包，可以满足您的需要。它只是在底层使用 np.cumsum() 和 np.interp() 。

score 1 · Accepted Answer

分享一些我掌握的代码。这允许您在 Excel 电子表格的每一列上运行统计信息。

import xlrd
import sys
import csv
import numpy as np
import itertools
from itertools import chain

book = xlrd.open_workbook('/filepath/workbook.xlsx')
sh = book.sheet_by_name("Sheet1")
ofile = '/outputfilepath/workbook.csv'

masses = sh.col_values(0, start_rowx=1)  # first column has mass
age = sh.row_values(0, start_colx=1)   # first row has age ranges

count = 1
mass = []
for a in ages:
    age.append(sh.col_values(count, start_rowx=1))
    count += 1

stats = []
count = 0    
for a in ages:
    expanded = []
    # create a tuple with the mass vector

    age_mass = zip(masses, age[count])
    count += 1
    # replicate element[0] for element[1] times
    expanded = list(list(itertools.repeat(am[0], int(am[1]))) for am in age_mass)

    #  separate into one big list
    medianlist = [x for t in expanded for x in t]

    # convert to array and mask out zeroes
    npa = np.array(medianlist)
    npa = np.ma.masked_equal(npa,0)

    median = np.median(npa)
    meanMass = np.average(npa)
    maxMass = np.max(npa)
    minMass = np.min(npa)
    stdev = np.std(npa)

    stats1 = [median, meanMass, maxMass, minMass, stdev]
    print stats1

    stats.append(stats1)

np.savetxt(ofile, (stats), fmt="%d")

python - NumPy 或 SciPy 计算加权中位数

3 回答 3

Related

Reference