c# - Removing words with special characters in them

Question

I have a long string composed of a number of different words.

I want to go through all of them, and if the word contains a special character or number (except '-'), or starts with a Capital letter, I want to delete it (the whole word not just that character). For all intents and purposes 'foreign' letters can count as special characters.

The obvious solution is to run a loop through each word (after splitting it) and then a loop through each character - but I'm hoping there's a faster way of doing it? Perhaps using Regex but I've almost no experience with it.

Thanks

ADDED:

(What I want for example:)

Input: "this Is an Example of 5 words in an input like-so from example.com"

Output: {this,an,of,words,in,an,input,like-so,from}

(What I've tried so far)

List<string> response = new List<string>();

string[] splitString = text.Split(' ');

foreach (string s in splitString)
{
    bool add = true;
    foreach (char c in s.ToCharArray())
    {
         if (!(c.Equals('-') || (Char.IsLetter(c) && Char.IsLower(c))))
         {
             add = false;
             break;
         }
         if (add)
         {
             response.Add(s);
         }
    }
}

Edit 2:

For me a word should be a number of characters (a..z) seperated by a space. ,/./!/... at the end shouldn't count for the 'special character' condition (which is really mostly just to remove urls or the like)

So: "I saw a dog. It was black!" should result in {saw,a,dog,was,black}

score 2 · Accepted Answer

因此，您想查找所有仅包含字符的“单词”a-z或-, 以空格分隔的单词？

像这样的正则表达式会找到这样的词：

(?<!\S)[a-z-]+(?!\S)

为了还允许以单个标点符号结尾的单词，您可以使用：

(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))

示例（ideone）：

var re = @"(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))";
var str = "this, Is an! Example of 5 words in an input like-so from example.com foo: bar?";

var m = Regex.Matches(str, re);

Console.WriteLine("Matched: ");
foreach (Match i in m)
    Console.Write(i + " ");

注意字符串中的标点符号。

输出：

Matched: 
this an of words in an input like-so from foo bar

score 1 · Accepted Answer

这个怎么样？

(?<=^|\s+)(?[az-]+)(?=$|\s+)

编辑：意思(?<=^|\s+)(?<word>[a-z\-]+)(?=(?:\.|,|!|\.\.\.)?(?:$|\s+))

规则：

Word 只能以行首或一定数量的空白字符开头
单词后面只能跟行尾或一定数量的空白字符（Edit 支持以句点、逗号、感叹号和省略号结尾的单词）
Word 只能包含小写（拉丁）字母和破折号

包含每个单词的命名组是“word”

score 0 · Accepted Answer

看看微软的如何：使用正则表达式搜索字符串（C# 编程指南） ——它是关于 C# 中的正则表达式的。

score 0 · Accepted Answer

您可以通过两种方式做到这一点，白名单方式和黑名单方式。使用白名单，您可以定义您认为可以接受的字符集，而使用黑名单则相反。

让我们假设白名单方式，并且您只接受 characters和a-zcharacter . 此外，您有规则，单词的第一个字符不能是大写字符。A-Z-

有了这个，你可以做这样的事情：

string target = "This is a white-list example: (Foo, bar1)";

var matches = Regex.Matches(target, @"(?:\b)(?<Word>[a-z]{1}[a-zA-Z\-]*)(?:\b)");

string[] words = matches.Cast<Match>().Select(m => m.Value).ToArray();

Console.WriteLine(string.Join(", ", words));

输出：

// is, a, white-list, example

score 0 · Accepted Answer

List<string> strings = new List<string>() {"asdf", "sdf-sd", "sdfsdf"};

for (int i = strings.Count-1; i > 0; i--)
{
   if (strings[i].Contains("-"))
   {
       strings.Remove(strings[i]);
   }
}

score 0 · Accepted Answer

这可能是一个起点。现在它只检查“。” 作为一个特殊的字符。这输出：“this an of words in an like-so from”

        string pattern = @"[A-Z]\w+|\w*[0-9]+\w*|\w*[\.]+\w*";
        string line = "this Is an Example of 5 words in an in3put like-so from example.com";

        System.Text.RegularExpressions.Regex r = new System.Text.RegularExpressions.Regex(pattern);
        line = r.Replace(line,"");

score 0 · Accepted Answer

您可以使用前瞻和后瞻来执行此操作。这是一个与您的示例匹配的正则表达式：

(?<=\s|^)[a-z-]+(?=\s|$)

解释是：匹配一个或多个字母字符（仅小写，加上连字符），只要字符之前是空格（或字符串的开头），只要后面是空格或结尾字符串。

您现在需要做的就是将其插入System.Text.RegularExpressions.Regex.Matches(input, regexString)以获取您的单词列表。

参考： http: //www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet

c# - Removing words with special characters in them

7 回答 7

示例（ideone）：

Related

Reference