This is best described with an example. Given the paragraph:
The longest string in this paragraph is not the shortest string in the paragraph because it is the longest string in the paragraph
I want to list the order of matching sub-strings first by frequency and then by length, so in this case, it should list (non case-sensitive)
The longest string in
the paragraph
is not the shortest string in
because
it is
this
The above lists the substrings by the order of frequency they occur, followed by length, so The longest string in
is repeated twice and is the longest substring. is not the shortest string in
is longer than the paragraph
, but the paragraph
is repeated twice, so it is listed first.
Update(based on observation by AlexC and MattBurland):
Even if a sub-string such as the space character or in
occur more than other substrings, they should not be listed if they are already included in a substring that is longer than their occurrence * length. For example, in
occurs 3 times which is 6 characters in length (9 including spaces at the end), but since 9 characters is shorter than the paragraph
, it is not listed. I hope this makes sense?