2

I'm searching several million names and addresses in a Postgres table. I'd like to use pg_trgm to do fast fuzzy search.

My app is actually very similar to the one in Optimizing a postgres similarity query (pg_trgm + gin index), and the answer there is pretty good.

My problem is that the relevance ranking isn't very good. There are two issues:

  1. I want names to get a heavier weight in the ranking than addresses, and it's not clear how to do that and still get good performance. For example, if a user searches for 'smith', I want 'Bob Smith' to appear higher in the results than '123 Smith Street'.

  2. The current results are biased toward columns that contain fewer characters. For example, a search for 'bob' will rank 'Bobby Smith' (without an address) above 'Bob Smith, 123 Bob Street, Smithville Illinois, 12345 with some other info here'. The reason for this is that the similarity score penalizes for parts of the string that do not match the search terms.

I'm thinking that I'll get a much better result if I could get a score that simply returns the number of matched trigrams in a record, not the number of trigrams scaled by the length of the target string. That's the way most search engines (like Elastic) work -- they rank by the weighted number of hits and do not penalize long documents.

Is it possible to do this with pg_trgm AND get good (sub-second) performance? I could do an arbitrary ranking of results, but if the ORDER BY clause does not match the index, then performance will be poor.

4

0 回答 0