1

I am doing some analytics using Solr and specifically using the faceting and pivot functionality for a large set of log files. I have a large log file that I have indexed in Solr along the lines of.

    Keyword  Visits log_date_ISO
1    red      1,938  2013-01-01
2    blue     435    2013-02-01
3    green    318    2013-04-01
4    red blue 279    2013-01-01

I then run a query and facet by 'log_date_ISO' to give me keyword counts by date that contain the query term. Two questions:

(1) Is there a way to sum the visits per keyword for each date - because what I really want is to sum visits across keywords that contain the query:

-> e.g. if I ran query 'red' for the above - I would want date 2013-01-01 to have a count of 1938 + 279 = 2217 (i.e. the sum of the visits associated with the keywords that contain the query 'red') rather than '2' (i.e. the count of the keywords containing the query).

(2) Is there a way to normalise by monthly query volume?

-> e.g. if the query volume for '2013-01-01' was 10,000 then the normalised volume for the query 'red' would be 2217/10000 = 0.2217

LAST RESORT: If these are not possible, I will pre-process the log file using pandas/python to group by date, then by keyword then normalise - but was wondering if it was possible in Solr.

Thanks in advance.

4

2 回答 2

1

这是一种方法(类似于Dan Allen 在此处的回答):

In [11]: keywords = df.pop('Keyword').apply(lambda x: pd.Series(x.split())).stack()

In [12]: keywords.index = keywords.index.droplevel(-1)

In [13]: keywords.name = 'Keyword'

In [14]: df1 = df.join(keywords)

In [15]: df1
Out[15]:
   Visits  log_date_ISO  Keyword
1    1938    2013-01-01      red
2     435    2013-02-01     blue
3     318    2013-04-01    green
4     279    2013-01-01      red
4     279    2013-01-01     blue

然后你可以做相关的groupby:

In [16]: df1.groupby(['log_date_ISO', 'Keyword']).sum()
Out[16]:
                        Visits
log_date_ISO  Keyword
2013-01-01    blue         279
              red         2217
2013-02-01    blue         435
2013-04-01    green        318

为了获得访问次数的百分比(以避免重复计算),我首先进行转换

df['VisitsPercentage'] = df.groupby('log_date_ISO')['Visits'].transform(lambda x: x / x.sum())

# follow the same steps as above

In [21]: df2 = df.join(keywords)

In [22]: df2
Out[22]:
   Visits  log_date_ISO  VisitsPercentage  Keyword
1    1938    2013-01-01          0.874154      red
2     435    2013-02-01          1.000000     blue
3     318    2013-04-01          1.000000    green
4     279    2013-01-01          0.125846      red
4     279    2013-01-01          0.125846     blue
于 2013-09-02T21:39:34.250 回答
0

可以使用 solr 按记录中的一个字段分组,并按组对记录中的另一个字段求​​和,使用

(1) Facets/pivots(按指定字段对数据进行分组)

(2) StatComponent(计算指定字段的字段统计-包括总和)

我打的电话是(与问题中的名称不同,“关键字”字段称为“q_string”,上面的“访问”称为“q_visits”,下面的“log_date_ISO”称为“q_date”):

http://localhost:8983/solr/select?q=neuron&stats=true&stats.field=q_visits&rows=1&indent=true&stats.facet=q_date

这为 *q_visits* 字段按日期提供了基本统计信息 - 包括总和 - 我感兴趣的具体值是总和:

<double name="min">1.0</double>
<double name="max">435.0</double>
<long name="count">263</long>
<long name="missing">0</long>
<double name="sum">845.0</double>
<double name="sumOfSquares">192917.0</double>
<double name="mean">3.2129277566539924</double>
<double name="stddev">26.94368427501248</double>

为其收集静态数据的字段在 schema.xml 中声明为 float 类型(如果将其声明为字符串,则不会显示 sum、sd、mean)。

于 2013-09-03T19:49:12.070 回答