0

我目前有一个utilities.py具有此机器学习功能的文件

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
import models
import random

words = [w.strip() for w in open('words.txt') if w == w.lower()]
def scramble(s):
    return "".join(random.sample(s, len(s)))

@models.db_session
def check_pronounceability(word):

    scrambled = [scramble(w) for w in words]

    X = words+scrambled
    y = ['word']*len(words) + ['unpronounceable']*len(scrambled)
    X_train, X_test, y_train, y_test = train_test_split(X, y)

    text_clf = Pipeline([
        ('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))),
        ('clf', MultinomialNB())
        ])
    text_clf = text_clf.fit(X_train, y_train)
    stuff = text_clf.predict_proba([word])
    pronounceability = round(100*stuff[0][1], 2)
    models.Word(word=word, pronounceability=pronounceability)
    models.commit()
    return pronounceability

然后我打电话给我app.py

from flask import Flask, render_template, jsonify, request
from rq import Queue
from rq.job import Job
from worker import conn
from flask_cors import CORS
from utilities import check_pronounceability

app = Flask(__name__)

q = Queue(connection=conn)

import models
@app.route('/api/word', methods=['POST', 'GET'])
@models.db_session
def check():
    if request.method == "POST":
        word = request.form['word']
        if not word:
            return render_template('index.html')
        db_word = models.Word.get(word=word)
        if not db_word:
            job = q.enqueue_call(check_pronounceability, args=(word,))
        return jsonify(job=job.id)

阅读python-rq 性能说明后,它指出

您可以用来提高此类作业的吞吐量性能的一种模式是在分叉之前导入必要的模块。

然后我使worker.py文件看起来像这样

import os

import redis
from rq import Worker, Queue, Connection

listen = ['default']

redis_url = os.getenv('REDISTOGO_URL', 'redis://localhost:6379')

conn = redis.from_url(redis_url)
import utilities

if __name__ == '__main__':
    with Connection(conn):
        worker = Worker(list(map(Queue, listen)))
        worker.work()

我遇到的问题是这仍然运行缓慢,是我做错了什么吗?当我检查一个单词时,有什么方法可以通过将所有内容存储在内存中来加快运行速度?根据我在 python-rq 中所做的 xpost看来我正在正确导入它

4

1 回答 1

2

我有几个建议:

  1. 在开始优化python-rq检查瓶颈所在的吞吐量之前。check_pronounceability如果队列是瓶颈而不是功能,我会感到惊讶。

  2. 确保check_pronounceability每次调用都尽可能快地运行,忘记在这个阶段无关的队列。

为了优化check_pronounceability我建议你

  1. 所有API 调用创建一次训练数据

  2. 忘记train_test_split你没有使用test_split,那么你为什么要浪费 CPU 周期来创建它

  3. 所有API 调用训练一次NaiveBayes - 输入是单个单词,需要将其分类为可发音或不可发音,无需为每个新单词创建新模型,只需创建一个模型并将其重用于所有单词,这也将有利于产生稳定的结果,并且更容易改变模型check_pronounceability

建议修改如下

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
import models
import random

words = [w.strip() for w in open('words.txt') if w == w.lower()]
def scramble(s):
    return "".join(random.sample(s, len(s)))

scrambled = [scramble(w) for w in words]
X = words+scrambled
# explicitly create binary labels
label_binarizer = LabelBinarizer()
y = label_binarizer.fit_transform(['word']*len(words) + ['unpronounceable']*len(scrambled))

text_clf = Pipeline([
    ('vect', CountVectorizer(analyzer='char', ngram_range=(1, 3))),
    ('clf', MultinomialNB())
])
text_clf = text_clf.fit(X, y)
# you might want to persist the Pipeline to disk at this point to ensure it's not lost in case there is a crash    

@models.db_session
def check_pronounceability(word):
    stuff = text_clf.predict_proba([word])
    pronounceability = round(100*stuff[0][1], 2)
    models.Word(word=word, pronounceability=pronounceability)
    models.commit()
    return pronounceability

最后注意事项:

  • 我假设您已经在其他地方对模型进行了一些交叉验证,以实际发现它在预测标签概率方面做得很好,如果您没有这样做的话。

  • 一般来说,NaiveBayes 在生成可靠的类别概率预测方面并不是最好的,它往往过于自信或过于胆小(概率接近 1 或 0)。您应该在数据库中检查。使用 LogisticRegression 分类器应该可以为您提供更可靠的概率预测。既然模型训练不是 API 调用的一部分,那么训练模型需要多长时间并不重要。

于 2017-02-15T14:14:36.493 回答