有很多方法可以可视化数据集。我想在这里将所有这些方法放在一起,并为此选择iris
了数据集。为了做到这一点,这些被写在这里。
我会使用pandas
“可视化”或seaborn
“.
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.plotting import parallel_coordinates
import pandas as pd
# Parallel Coordinates
# Load the data set
iris = sns.load_dataset("iris")
parallel_coordinates(iris, 'species', color=('#556270', '#4ECDC4', '#C7F464'))
plt.show()
结果如下:
from pandas.plotting import andrews_curves
# Andrew Curves
a_c = andrews_curves(iris, 'species')
a_c.plot()
plt.show()
from seaborn import pairplot
# Pair Plot
pairplot(iris, hue='species')
plt.show()
还有另一个我认为使用最少和最重要的情节是以下情节:
from plotly.express import scatter_3d
# Plotting in 3D by plotly.express that would show the plot with capability of zooming,
# changing the orientation, and rotating
scatter_3d(iris, x='sepal_length', y='sepal_width', z='petal_length', size="petal_width",
color="species", color_discrete_map={"Joly": "blue", "Bergeron": "violet", "Coderre": "pink"})\
.show()
这会在您的浏览器中绘制并需要 HTML5,您可以随心所欲地查看它。下一个数字就是那个。请记住,这是一个散点图,每个球的大小都显示了四个特征的数据,petal_width
因此所有四个特征都在一个图中。
朴素贝叶斯是一种用于二分类(二分类)和多分类问题的分类算法。它被称为朴素贝叶斯,因为每个类别的概率计算都被简化以使其计算易于处理。与其尝试计算每个属性值的概率,不如假定它们在给定类值的情况下是条件独立的。这是一个非常强的假设,在真实数据中是最不可能的,即属性不交互。然而,该方法在这种假设不成立的数据上表现得非常好。
这是开发模型来预测该数据集的标签的一个很好的例子。您可以使用此示例来开发每个模型,因为这是它的基础。
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import seaborn as sns
# Load the data set
iris = sns.load_dataset("iris")
iris = iris.rename(index=str, columns={'sepal_length': '1_sepal_length', 'sepal_width': '2_sepal_width',
'petal_length': '3_petal_length', 'petal_width': '4_petal_width'})
# Setup X and y data
X_data_plot = df1.iloc[:, 0:2]
y_labels_plot = df1.iloc[:, 2].replace({'setosa': 0, 'versicolor': 1, 'virginica': 2}).copy()
x_train, x_test, y_train, y_test = train_test_split(df2.iloc[:, 0:4], y_labels_plot, test_size=0.25,
random_state=42) # This is for the model
# Fit model
model_sk_plot = GaussianNB(priors=None)
nb_model = GaussianNB(priors=None)
model_sk_plot.fit(X_data_plot, y_labels_plot)
nb_model.fit(x_train, y_train)
# Our 2-dimensional classifier will be over variables X and Y
N_plot = 100
X_plot = np.linspace(4, 8, N_plot)
Y_plot = np.linspace(1.5, 5, N_plot)
X_plot, Y_plot = np.meshgrid(X_plot, Y_plot)
plot = sns.FacetGrid(iris, hue="species", size=5, palette='husl').map(plt.scatter, "1_sepal_length",
"2_sepal_width", ).add_legend()
my_ax = plot.ax
# Computing the predicted class function for each value on the grid
zz = np.array([model_sk_plot.predict([[xx, yy]])[0] for xx, yy in zip(np.ravel(X_plot), np.ravel(Y_plot))])
# Reshaping the predicted class into the meshgrid shape
Z = zz.reshape(X_plot.shape)
# Plot the filled and boundary contours
my_ax.contourf(X_plot, Y_plot, Z, 2, alpha=.1, colors=('blue', 'green', 'red'))
my_ax.contour(X_plot, Y_plot, Z, 2, alpha=1, colors=('blue', 'green', 'red'))
# Add axis and title
my_ax.set_xlabel('Sepal length')
my_ax.set_ylabel('Sepal width')
my_ax.set_title('Gaussian Naive Bayes decision boundaries')
plt.show()
添加您认为必要的任何内容,例如 3d 中的决策边界是我以前没有做过的。