Working on something similar to Solr's WordDelimiterFilter, but not in Java.
Want to split words into tokens like this:
P90X = P, 90, X (split on word/number boundary)
TotallyCromulentWord = Totally, Cromulent, Word (split on lowercase/uppercase boundary)
TransAM = Trans, AM
Looking for a general solution, not specific to the above examples. Preferably in a regex flavour that doesn't support lookbehind, but I can use PL/perl if necessary, which can do lookbehind.
Found a few answers on SO, but they all seemed to use lookbehind.
Things to split on:
- Transition from lowercase letter to upper case letter
- Transition from letter to number or number to letter
- (Optional) split on a few other characters (- _)
My main concern is 1 and 2.