php - 提高速度建议 Neo4j

Question

我正在尝试使用 Neo4j 和 Reco4PHP 创建一个简单的推荐引擎。

数据模型由以下节点和关系组成：

(用户)-[:HAS_BOUGHT]->(产品{category_id: int})-[:DESIGNED_BY]->(设计师)

在这个系统中，我想推荐产品并提升与用户已经购买的设计师相同的产品。为了创建推荐，我使用了一个 Discovery 类和一个 Post-Processor 类来提升产品。见下文。这有效，但速度很慢。完成需要超过 5 秒，而数据模型包含约 1000 种产品和约 100 名设计师。

// Disovery class
    <?php
namespace App\Reco4PHP\Discovery;
use GraphAware\Common\Cypher\Statement;
use GraphAware\Common\Type\NodeInterface;
use GraphAware\Reco4PHP\Engine\SingleDiscoveryEngine;

class InCategory extends SingleDiscoveryEngine {

    protected $categoryId;

    public function __construct($categoryId) {
        $this->categoryId = $categoryId;
    }

    /**
     * @return string The name of the discovery engine
     */
    public function name() {
        return 'in_category';
    }

    /**
     * The statement to be executed for finding items to be recommended
     *
     * @param \GraphAware\Common\Type\NodeInterface $input
     * @return \GraphAware\Common\Cypher\Statement
     */
    public function discoveryQuery(NodeInterface $input) {

        $query = "
            MATCH (reco:Card)
            WHERE reco.category_id = {category_id}
            RETURN reco, 1 as score
        ";

        return Statement::create($query, ['category_id' => $this->categoryId]);
    }
}

// Boost shared designers
class RewardSharedDesigners extends RecommendationSetPostProcessor {

    public function buildQuery(NodeInterface $input, Recommendations $recommendations)
    {
        $ids = [];
        foreach ($recommendations->getItems() as $recommendation) {
            $ids[] = $recommendation->item()->identity();
        }

        $query = 'UNWIND {ids} as id
        MATCH (reco) WHERE id(reco) = id
        MATCH (user:User) WHERE id(user) = {userId}
        MATCH (user)-[:HAS_BOUGHT]->(product:Product)-[:DESIGNED_BY]->()<-[:DESIGNED_BY]-(reco)

        RETURN id, count(product) as sharedDesignedBy';

        return Statement::create($query, ['ids' => $ids, 'userId' => $input->identity()]);
    }

    public function postProcess(Node $input, Recommendation $recommendation, Record $record) {
        $recommendation->addScore($this->name(), new SingleScore((int)$record->get('sharedDesignedBy')));
    }

    public function name() {
        return 'reward_shared_designers';
    }
}

我很高兴它可以工作，但如果计算时间超过 5 秒，它在生产环境中就无法使用。

为了提高我的速度：

在 Product:id 和 Designer:id 中创建索引
将node_auto_indexing=true添加到 neo4j.properties。
将-Xmx4096m添加到 .neo4j-community.vmoptions 但这并没有真正的区别。

这些 Cypher 查询需要 5 秒以上是正常的，还是有一些改进的可能？:)

score 2 · Accepted Answer

主要问题在于您的后处理器查询。目标是：

根据我从设计推荐项目的设计师那里购买的产品数量来提升推荐。

因此，您可以稍微修改您的查询以直接匹配设计器并在其上进行聚合，最好先找到用户UNWIND，否则它将在产品 id 的每次迭代中匹配用户：

MATCH (user) WHERE id(user) = {userId}
UNWIND {ids} as productId
MATCH (product:Product)-[:DESIGNED_BY]->(designer)
WHERE id(product) = productId
WITH productId, designer, user
MATCH (user)-[:BOUGHT]->(p)-[:DESIGNED_BY]->(designer)
RETURN productId as id, count(*) as score

完整的后处理器如下所示：

    public function buildQuery(NodeInterface $input, Recommendations $recommendations)
    {
        $ids = [];
        foreach ($recommendations->getItems() as $recommendation) {
            $ids[] = $recommendation->item()->identity();
        }

        $query = 'MATCH (user) WHERE id(user) = {userId}
        UNWIND {ids} as productId
        MATCH (product:Product)-[:DESIGNED_BY]->(designer)
        WHERE id(product) = productId
        WITH productId, designer, user
        MATCH (user)-[:BOUGHT]->(p)-[:DESIGNED_BY]->(designer)
        RETURN productId as id, count(*) as score';

        return Statement::create($query, ['userId' => $input->identity(), 'ids' => $ids]);
    }

    public function postProcess(Node $input, Recommendation $recommendation, Record $record)
    {
        $recommendation->addScore($this->name(), new SingleScore($record->get('score')));
    }

我创建了一个存储库，我在您的域之后有一个功能齐全的实现：

https://github.com/ikwattro/reco4php-example-so

收到数据后更新

您在产品和用户之间有多个相同类型的关系这一事实正在为找到的模式的数量增加指数。

有两种解决方案：

区分它们并在模式的末尾添加一个 WHERE 子句：

MATCH (user) WHERE id(user) = {userId}
UNWIND {ids} as cardId
MATCH (reco:Card)-[:DESIGNED_BY]->(designer) WHERE id(reco) = cardId
MATCH (user)-[:HAS_BOUGHT]->(x)
WHERE (x)-[:DESIGNED_BY]->(designer)
RETURN cardId as id, count(*) as sharedDesignedBy

在 Neo4j 3.0+ 中，您可以从USING JOIN使用中受益并保持与您相同的查询：

MATCH (user) WHERE user.id = 245
UNWIND ids as id
MATCH (reco:Card) WHERE id(reco) = id
MATCH (user:User)-[:HAS_BOUGHT]->(card:Card)-[:DESIGNED_BY]->(designer:Designer)<-[:DESIGNED_BY]-(reco:Card)
USING JOIN ON card
RETURN id, count(card) as sharedDesignedBy

运行这些查询，我用您当前的数据集将discovery+时间缩短到 190 毫秒。post processing

score 0 · Accepted Answer

我只能对 Cypher 发表评论，即便如此，我也不能评论太多，因为您没有包含函数 GetItems() 或数据（密码转储）。但很少有事情突出

在（reco）上使用标签会更快，我认为它是产品？
另外我假设这是可以放入的设计师标签 - [:DESIGNED_BY]->()<-[:DESIGNED_BY]?
如果有任何机会 GetItems() 一项一项地检索项目，那可能是问题所在，也是需要索引的地方。顺便说一句，为什么不将该条件放在主查询中？

我也不懂 id 上的索引？如果它们是 Neo4j id，它们是物理位置，不需要索引，如果它们不是你为什么使用 id() 函数？

总之，标签可能会有所帮助，但如果您的数据集很大，请不要指望奇迹，Neo4j 上的聚合不是超级快。在没有过滤器的情况下计算 10M 条记录花了我 12 秒。

php - 提高速度建议 Neo4j

2 回答 2

Related

Reference