0

我认为这是一个简单的应用程序,但我在互联网上找不到食谱。

您能否建议一个JSON查询发送python到 Elasticsearch 实例,该实例将返回特定字段中特定术语的频率?

我想这应该可以通过对 Term Vector API 的一些调整来实现,但这似乎并不简单。

我不介意同时获得绝对频率和包含该术语的文档数量。

4

2 回答 2

1

如果您有 ID,则可以使用 Multivectors API https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-multi-termvectors.html

curl -X POST "localhost:9200/index/type/_mtermvectors?pretty" -H 'Content-Type: application/json' -d' 
{
    "ids" : ["your_document_id1","your_document_id2"],      
    "parameters": {
        "fields": [
                "your_field"       
        ],
        "term_statistics": true
    }
}
'

您甚至可以传递带有您要分析的术语的人工文档。如此处所述(https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html),确保 term_statistics 设置为 true,以便您可以在索引中获取此信息:

  • 总词频(一个词在所有文档中出现的频率)
  • 文档频率(包含当前术语的文档数量)
于 2020-02-11T17:09:21.287 回答
0

实际上有一个简单的解决方案,如下所示:

from elasticsearch import Elasticsearch as ES
from copy import deepcopy as copy
import sys

_field = sys.argv[1]
_terms = sys.argv[2:]

_timeout = 60
_gate    = 'some.gate.org/'
_index   = 'some_index'
_client  = ES([_gate],scheme='http',port=80,timeout=_timeout) #or however to get connection

_body= {"doc": {_field: None}, "term_statistics" : True, "field_statistics" : True, "positions": False, "offsets": False}

for term in terms_:
    body   = copy(_body); body["doc"][_field] = term
    result = _client.termvectors(index=_index,body=body)
    print 'documents with', term, ':', result['term_vectors'][_field]['terms'][term]['doc_freq']
    print 'frequency of  ', term, ':', result['term_vectors'][_field]['terms'][term]['ttf']
于 2020-02-12T13:01:46.003 回答