敬启者,
下面的代码在基于 jupyter 的数据科学笔记本的 Docker 容器中运行;但是,我已经安装了 Java 8 和 h2o(版本 3.20.0.7),并公开了必要的端口。docker 容器在使用 Ubuntu 16.04 的系统上运行,具有 32 个线程和超过 300G 的 RAM。
h2o 正在使用所有线程和 26.67 Gb 内存。我试图使用下面的代码将文本分类为 0 或 1。
然而,尽管将 max_runtime_secs 设置为 900 或 15 分钟,但代码尚未完成执行,大约 15 小时后仍占用大部分机器资源。附带说明一下,解析 df_train 大约需要 20 分钟。对出了什么问题有任何想法吗?
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
df = pd.read_csv('Data.csv')[['Text', 'Classification']]
vectorizer = CountVectorizer(analyzer='word',token_pattern=r'\w{1,}',
ngram_range=(1, 3), stop_words = 'english')
x_train_vec = vectorizer.fit_transform(df['Text'])
y_train = df['Classification']
import h2o
from h2o.automl import H2OAutoML
h2o.init()
df_train = h2o.H2OFrame(x_train_vec.A, header=-1, column_names=vectorizer.get_feature_names())
df_labels = h2o.H2OFrame(y_train.reset_index()[['Classification']])
df_train = df_train.concat(df_labels)
x_train_cn = df_train.columns
y_train_cn = 'Classification'
x_train_cn.remove(y_train_cn)
df_train[y_train_cn] = df_train[y_train_cn].asfactor()
h2o_aml = H2OAutoML(max_runtime_secs = 900, exclude_algos = ["DeepLearning"])
h2o_aml.train(x = x_train_cn , y = y_train_cn, training_frame = df_train)
lb = h2o_aml.leaderboard
y_predict = h2o_aml.leader.predict(df_train.drop('Classification'))
print('accuracy: {}'.format(accuracy_score(y_pred=y_predict, y_true=y_train)))
print('precision: {}'.format(precision_score(y_pred=y_predict, y_true=y_train)))
print('recall: {}'.format(recall_score(y_pred=y_predict, y_true=y_train)))
print('f1: {}\n'.format(f1_score(y_pred=y_predict, y_true=y_train)))