Rather than a binary is-word-or-not question, what you might consider instead is the probability of a word being gibberish. You can then choose a threshold that you like.
For computing word probalities, you might try the Web Language Model API. You could look at the joint probability, as an example. For your set of words, the response looks as follows (values for the body
corpus):
{
"results": [
{
"words": "sdf#%#",
"probability": -12.215
},
{
"words": "asfsds",
"probability": -12.215
},
{
"words": "b",
"probability": -3.127
},
{
"words": "hi",
"probability": -3.905
},
{
"words": "my",
"probability": -2.528
},
{
"words": "name",
"probability": -3.128
},
{
"words": "is",
"probability": -2.201
},
{
"words": "sam.",
"probability": -12.215
},
{
"words": "sam",
"probability": -4.431
}
]
}
You will notice a couple of idiosyncrasies:
- Probabilities are negative. This is because they are logarithmic.
- All terms are case-folded. This means that the corpus won't
distinguish between, say, GOAT and goat.
- Caller must perform a
certain amount of normalization themselves (note probability of
sam.
vs sam
)
- Corpora are only available for the en-us market. This could be problematic
depending on your use case.
An advanced use case would be computing conditional probabilities, i.e. the probability of a word in the context of words preceding it.