python - 如何在 scikit 中使用外部数据来执行 PCA/LDA？

Question

我正在尝试按照这个示例，使用我自己的数据，使用 scikit-learn 执行线性判别分析和主成分分析。我的数据看起来像：

id,mois,prot,fat,ash,sodium,carb,cal,brand
14069,27.82,21.43,44.87,5.11,1.77,0.77,4.93,a
14053,28.49,21.26,43.89,5.34,1.79,1.02,4.84,a
14025,28.35,19.99,45.78,5.08,1.63,0.8,4.95,a
14016,30.55,20.15,43.13,4.79,1.61,1.38,4.74,a
14005,30.49,21.28,41.65,4.82,1.64,1.76,4.67,a
14075,31.14,20.23,42.31,4.92,1.65,1.4,4.67,a
14082,31.21,20.97,41.34,4.71,1.58,1.77,4.63,a
14097,28.76,21.41,41.6,5.28,1.75,2.95,4.72,a
14117,28.22,20.48,45.1,5.02,1.71,1.18,4.93,a
14133,27.72,21.19,45.29,5.16,1.66,0.64,4.95,a
...

brand是目标变量。

按照上面链接的示例，我从以下代码开始：

# Import libraries
import pylab as pl
%pylab inline
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.lda import LDA
import pandas as pd 

# Set up the data for the example
pizza_raw        = pd.read_csv("C:\mypath\pizza.csv")
pizza_target     = pizza_raw["brand"]

# select all but the last column as data
pizza_data       = pizza_raw.ix[:,:-1]
pizza_names      = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l"]

# Principal Components
pca = PCA(n_components=2)
X_r = pca.fit(pizza_data).transform(pizza_data)

# Linear Discriminant Analysis
lda = LDA(n_components=2)
X_r2= lda.fit(pizza_data, pizza_target).transform(pizza_data)

# Percentage of variance explained for each components
print('PCA explained variance ratio (first two components): %s'
      % str(pca.explained_variance_ratio_))

以上所有工作都按预期工作（我认为）。示例中的下一步是绘制数据。（该示例适用于 IRIS 数据集...）示例代码如下所示

pl.figure()
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
    pl.scatter(X_r[y == i, 0], X_r[y == i, 1], c=c, label=target_name)
pl.legend()
pl.title('PCA of IRIS dataset')

pl.figure()
for c, i, target_name in zip("rgb", [0, 1, 2], target_names):
    pl.scatter(X_r2[y == i, 0], X_r2[y == i, 1], c=c, label=target_name)
pl.legend()
pl.title('LDA of IRIS dataset')

pl.show()

那么两个问题：

到目前为止，我将数据拟合到教程的方法是否正确？
如何调整示例绘图代码以为我的数据生成相同的 PCA 和 LDA 图？

score 1 · Accepted Answer

像这样的东西应该工作：

color_marker = [(c, m) for c in "rgbc" for m in "123"] #assuming there are 12 types of pizza
for cm, i, target_name in zip(color_marker, range(12), pizza_names):
    pl.scatter(X_r[pizza_target == target_name, 0], X_r[pizza_target == target_name, 1], c=cm[0], marker=cm[1], label=target_name)

绘制 LDA 应该以类似的方式工作。还要检查文档pl.scatter（尤其是关于颜色和标记的部分）。希望有帮助。

披萨 PCA

python - 如何在 scikit 中使用外部数据来执行 PCA/LDA？

1 回答 1

Related

Reference