0

我有一个包含短语(几个到数百个)的数组。

例子:

adhesive materials
adhesive material
material adhesive
adhesive applicator
adhesive applicators
adhesive applications
adhesive application
adhesives applications
adhesive application systems
adhesive application system

以编程方式,使用 PHP,我想使用类似词干的东西将上面的列表减少到下面的列表(一些变化是可以接受的,例如,粘合剂涂抹器和粘合剂涂抹器可能难以相互区分,因为词干是相同的):

adhesive material
material adhesive
adhesive applicator
adhesive application
adhesive application system

做这个的最好方式是什么?

4

1 回答 1

1

You'd decide a minimum threshold and then use the levenshtein function to determine how close words would have to be.

It looks like you'd more or less be doing this:

$origs = array();
// assuming your example is an array already.
foreach( $setList as $set )
{
    $pieces = explode( ' ', $set );
    $add = true;
    foreach( $origs as $keySet )
    {
        if( levenshtein( $pieces[ 0 ], $keySet[ 0 ] ) < 3 ||
            levenshtein( $pieces[ 1 ], $keySet[ 0 ] ) < 3 )
        {
            $add = false;
            break;
        }
    }

    if( $add ) $origs[] = $pieces;
} 

You'll be left with a list similar to your output. Some modifications will need to be made if you have a preference that the shortest words be in the list, but you get the idea.

于 2011-08-15T03:51:41.977 回答