I would start with a query from the dt_matrix_token table joined with a second instance of the dt_matrix_token table, where both instances have a token_id in the range of values you are interested in, but they can't both have the same value.
They should also have a matching storage_data_id (i.e. they're in the same document), and the position of the second token must be greater than or equal to the first.
SELECT mt1.dt_matrix_token_id, mt1.storage_data_id,
mt1.token_id AS token_id1, mt2.token_id AS token_id2,
mt1.position AS position1, mt2.position AS position2
FROM dt_matrix_token AS mt1
JOIN dt_matrix_token AS mt2
WHERE mt1.token_id IN (1,2,3)
AND mt2.token_id IN (1,2,3)
AND mt1.token_id <> mt2.token_id
AND mt1.storage_data_id = mt2.storage_data_id
AND mt2.position >= mt1.position
This gives you every sequential pair of tokens that you care about.
Now if you group by the dt_matrix_token_id from the first table, combined with the token_id from the second table, you narrow down that set of results to one of each token_id from the second table for every token in the first.
And when grouping the results from the second table, it's the minimum position you care about. Since the second token always follows the first, this gives you the position that is nearest to the first token.
SELECT mt1.dt_matrix_token_id, mt1.storage_data_id,
mt1.token_id AS token_id1, mt2.token_id AS token_id2,
mt1.position AS position1, MIN(mt2.position) AS position2
FROM dt_matrix_token AS mt1
JOIN dt_matrix_token AS mt2
WHERE mt1.token_id IN (1,2,3)
AND mt2.token_id IN (1,2,3)
AND mt2.token_id <> mt1.token_id
AND mt2.storage_data_id = mt1.storage_data_id
AND mt2.position >= mt1.position
GROUP BY mt1.dt_matrix_token_id, mt2.token_id
So now, for every instance of a token you care about, you have the nearest position to any of the tokens that follow it in the same document.
But what you really want is the maximum distance from the first token to any of the tokens that follow it. So you need to group by the dt_matrix_token_id again, and calculate the distance to the maximum of the second positions (i.e. the maximum of the minimums for each token_id).
SELECT dt_matrix_token_id, storage_data_id,
MAX(position2)-position1 AS distance
FROM (
SELECT mt1.dt_matrix_token_id, mt1.storage_data_id,
mt1.position AS position1, MIN(mt2.position) AS position2
FROM dt_matrix_token AS mt1
JOIN dt_matrix_token AS mt2
WHERE mt1.token_id IN (1,2,3)
AND mt2.token_id IN (1,2,3)
AND mt2.token_id <> mt1.token_id
AND mt2.storage_data_id = mt1.storage_data_id
AND mt2.position >= mt1.position
GROUP BY mt1.dt_matrix_token_id, mt2.token_id
) AS temp
GROUP BY dt_matrix_token_id
However, not every token from the first table will have been followed by all of the other tokens you care about. So you need to make sure the COUNT
of the results in each group is equal to the number of tokens you care about minus one (1 token in the first table, and n-1 tokens in the second).
You can do this with a HAVING
clause - HAVING COUNT(*) = 3-1
- where the 3 in that expression represents the number of tokens you are searching for.
Now for every instance of a token you care about, that is followed by all of the other tokens you care about (in the same document), you have the shortest distance that covers all of them.
But there will quite likely be multiple results for each document, and you really only need to know the shortest in each case. So now you need to group by the storage_data_id and calculate the minimum distance in the group.
SELECT storage_data_id, MIN(distance) AS distance
FROM (
SELECT dt_matrix_token_id, storage_data_id,
MAX(position2)-position1 AS distance
FROM (
SELECT mt1.dt_matrix_token_id, mt1.storage_data_id,
mt1.position AS position1, MIN(mt2.position) AS position2
FROM dt_matrix_token AS mt1
JOIN dt_matrix_token AS mt2
WHERE mt1.token_id IN (1,2,3)
AND mt2.token_id IN (1,2,3)
AND mt2.token_id <> mt1.token_id
AND mt2.storage_data_id = mt1.storage_data_id
AND mt2.position >= mt1.position
GROUP BY mt1.dt_matrix_token_id, mt2.token_id
) AS temp
GROUP BY dt_matrix_token_id
HAVING COUNT(*) = 3-1
) AS temp
GROUP BY storage_data_id
This gives you each document that contains all the tokens you care about, and the minimum distance that covers all of those tokens. To limit the results to distances in a specific range, you can just add another HAVING
clause.
HAVING distance <= 20
Then the number of results from that query should tell you how many documents contain all the tokens you care about within the specified range.