2

I store a document-term matrix in mysql and want to get results for queries like these:

Example: Get all rows where token_id '1' and token_id '2'(but maybe even more than 2) are within a range of 10 words.

My table:

dt_matrix_token_id int(11) PK AUTO_INCREMENT,
token_id int(11),
storage_data_id int(11),
position int(11)

So basically token_id describes the token and position describes on which position in the original text the token was.

Selecting rows by token_id is not the problem, the problem is on how i describe inside a query that both words must be within a specific "radius/range".

Select * FROM dt_matrix_token WHERE token_id IN(1,2) AND ???

??? this is where i stuck, because how can i tell that it shall query against the found values? Because when the result contains a row with position = 12 all other valid rows should have position >= 2 & position =< 22

BTW: Could it be similiar to a geo location query within a radius?

Edit: Heres my actual progress with sample data: http://sqlfiddle.com/#!2/52f48/2

The query works fine, but it is not complete yet, so if 2x token 1 matches in the document, it is also a "valid" result, and this is of course false. its only correct when there are all given tokens. and the solution must be extendable to 3+ tokens.

4

1 回答 1

2

I would start with a query from the dt_matrix_token table joined with a second instance of the dt_matrix_token table, where both instances have a token_id in the range of values you are interested in, but they can't both have the same value.

They should also have a matching storage_data_id (i.e. they're in the same document), and the position of the second token must be greater than or equal to the first.

SELECT mt1.dt_matrix_token_id, mt1.storage_data_id,
  mt1.token_id AS token_id1, mt2.token_id AS token_id2,
  mt1.position AS position1, mt2.position AS position2
FROM dt_matrix_token AS mt1
JOIN dt_matrix_token AS mt2
WHERE mt1.token_id IN (1,2,3) 
  AND mt2.token_id IN (1,2,3)
  AND mt1.token_id <> mt2.token_id
  AND mt1.storage_data_id = mt2.storage_data_id
  AND mt2.position >= mt1.position 

This gives you every sequential pair of tokens that you care about.

Now if you group by the dt_matrix_token_id from the first table, combined with the token_id from the second table, you narrow down that set of results to one of each token_id from the second table for every token in the first.

And when grouping the results from the second table, it's the minimum position you care about. Since the second token always follows the first, this gives you the position that is nearest to the first token.

SELECT mt1.dt_matrix_token_id, mt1.storage_data_id,
  mt1.token_id AS token_id1, mt2.token_id AS token_id2,
  mt1.position AS position1, MIN(mt2.position) AS position2
FROM dt_matrix_token AS mt1
JOIN dt_matrix_token AS mt2
WHERE mt1.token_id IN (1,2,3) 
  AND mt2.token_id IN (1,2,3)
  AND mt2.token_id <> mt1.token_id
  AND mt2.storage_data_id = mt1.storage_data_id
  AND mt2.position >= mt1.position 
GROUP BY mt1.dt_matrix_token_id, mt2.token_id

So now, for every instance of a token you care about, you have the nearest position to any of the tokens that follow it in the same document.

But what you really want is the maximum distance from the first token to any of the tokens that follow it. So you need to group by the dt_matrix_token_id again, and calculate the distance to the maximum of the second positions (i.e. the maximum of the minimums for each token_id).

SELECT dt_matrix_token_id, storage_data_id,
  MAX(position2)-position1 AS distance
FROM (
  SELECT mt1.dt_matrix_token_id, mt1.storage_data_id,
    mt1.position AS position1, MIN(mt2.position) AS position2
  FROM dt_matrix_token AS mt1
  JOIN dt_matrix_token AS mt2
  WHERE mt1.token_id IN (1,2,3) 
    AND mt2.token_id IN (1,2,3)
    AND mt2.token_id <> mt1.token_id
    AND mt2.storage_data_id = mt1.storage_data_id
    AND mt2.position >= mt1.position 
  GROUP BY mt1.dt_matrix_token_id, mt2.token_id
) AS temp
GROUP BY dt_matrix_token_id

However, not every token from the first table will have been followed by all of the other tokens you care about. So you need to make sure the COUNT of the results in each group is equal to the number of tokens you care about minus one (1 token in the first table, and n-1 tokens in the second).

You can do this with a HAVING clause - HAVING COUNT(*) = 3-1 - where the 3 in that expression represents the number of tokens you are searching for.

Now for every instance of a token you care about, that is followed by all of the other tokens you care about (in the same document), you have the shortest distance that covers all of them.

But there will quite likely be multiple results for each document, and you really only need to know the shortest in each case. So now you need to group by the storage_data_id and calculate the minimum distance in the group.

SELECT storage_data_id, MIN(distance) AS distance
FROM (
  SELECT dt_matrix_token_id, storage_data_id,
    MAX(position2)-position1 AS distance
  FROM (
    SELECT mt1.dt_matrix_token_id, mt1.storage_data_id,
      mt1.position AS position1, MIN(mt2.position) AS position2
    FROM dt_matrix_token AS mt1
    JOIN dt_matrix_token AS mt2
    WHERE mt1.token_id IN (1,2,3) 
      AND mt2.token_id IN (1,2,3)
      AND mt2.token_id <> mt1.token_id
      AND mt2.storage_data_id = mt1.storage_data_id
      AND mt2.position >= mt1.position 
    GROUP BY mt1.dt_matrix_token_id, mt2.token_id
  ) AS temp
  GROUP BY dt_matrix_token_id
  HAVING COUNT(*) = 3-1
) AS temp
GROUP BY storage_data_id

This gives you each document that contains all the tokens you care about, and the minimum distance that covers all of those tokens. To limit the results to distances in a specific range, you can just add another HAVING clause.

HAVING distance <= 20

Then the number of results from that query should tell you how many documents contain all the tokens you care about within the specified range.

于 2013-08-03T23:55:57.153 回答