1

I have a long string composed of a number of different words.

I want to go through all of them, and if the word contains a special character or number (except '-'), or starts with a Capital letter, I want to delete it (the whole word not just that character). For all intents and purposes 'foreign' letters can count as special characters.

The obvious solution is to run a loop through each word (after splitting it) and then a loop through each character - but I'm hoping there's a faster way of doing it? Perhaps using Regex but I've almost no experience with it.

Thanks

ADDED:

(What I want for example:)

Input: "this Is an Example of 5 words in an input like-so from example.com"

Output: {this,an,of,words,in,an,input,like-so,from}

(What I've tried so far)

List<string> response = new List<string>();

string[] splitString = text.Split(' ');

foreach (string s in splitString)
{
    bool add = true;
    foreach (char c in s.ToCharArray())
    {
         if (!(c.Equals('-') || (Char.IsLetter(c) && Char.IsLower(c))))
         {
             add = false;
             break;
         }
         if (add)
         {
             response.Add(s);
         }
    }
}

Edit 2:

For me a word should be a number of characters (a..z) seperated by a space. ,/./!/... at the end shouldn't count for the 'special character' condition (which is really mostly just to remove urls or the like)

So: "I saw a dog. It was black!" should result in {saw,a,dog,was,black}

4

7 回答 7

2

因此,您想查找所有仅包含字符的“单词”a-z-, 以空格分隔的单词?

像这样的正则表达式会找到这样的词:

(?<!\S)[a-z-]+(?!\S)

为了还允许以单个标点符号结尾的单词,您可以使用:

(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))

示例(ideone):

var re = @"(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))";
var str = "this, Is an! Example of 5 words in an input like-so from example.com foo: bar?";

var m = Regex.Matches(str, re);

Console.WriteLine("Matched: ");
foreach (Match i in m)
    Console.Write(i + " ");

注意字符串中的标点符号。

输出:

Matched: 
this an of words in an input like-so from foo bar 
于 2012-05-24T11:54:55.840 回答
1

这个怎么样?

(?<=^|\s+)(?[az-]+)(?=$|\s+)

编辑:意思(?<=^|\s+)(?<word>[a-z\-]+)(?=(?:\.|,|!|\.\.\.)?(?:$|\s+))

规则:

  1. Word 只能以行首或一定数量的空白字符开头
  2. 单词后面只能跟行尾或一定数量的空白字符(Edit 支持以句点、逗号、感叹号和省略号结尾的单词)
  3. Word 只能包含小写(拉丁)字母和破折号

包含每个单词的命名组是“word”

于 2012-05-24T12:01:10.260 回答
0

看看微软的如何:使用正则表达式搜索字符串(C# 编程指南) ——它是关于 C# 中的正则表达式的。

于 2012-05-24T11:42:33.897 回答
0

您可以通过两种方式做到这一点,白名单方式和黑名单方式。使用白名单,您可以定义您认为可以接受的字符集,而使用黑名单则相反。

让我们假设白名单方式,并且您只接受 characters和a-zcharacter . 此外,您有规则,单词的第一个字符不能是大写字符。A-Z-

有了这个,你可以做这样的事情:

string target = "This is a white-list example: (Foo, bar1)";

var matches = Regex.Matches(target, @"(?:\b)(?<Word>[a-z]{1}[a-zA-Z\-]*)(?:\b)");

string[] words = matches.Cast<Match>().Select(m => m.Value).ToArray();

Console.WriteLine(string.Join(", ", words));

输出:

// is, a, white-list, example
于 2012-05-24T11:57:32.337 回答
0
List<string> strings = new List<string>() {"asdf", "sdf-sd", "sdfsdf"};

for (int i = strings.Count-1; i > 0; i--)
{
   if (strings[i].Contains("-"))
   {
       strings.Remove(strings[i]);
   }
}
于 2012-05-24T11:56:35.350 回答
0

这可能是一个起点。现在它只检查“。” 作为一个特殊的字符。这输出:“this an of words in an like-so from”

        string pattern = @"[A-Z]\w+|\w*[0-9]+\w*|\w*[\.]+\w*";
        string line = "this Is an Example of 5 words in an in3put like-so from example.com";

        System.Text.RegularExpressions.Regex r = new System.Text.RegularExpressions.Regex(pattern);
        line = r.Replace(line,"");
于 2012-05-24T11:56:40.857 回答
0

您可以使用前瞻和后瞻来执行此操作。这是一个与您的示例匹配的正则表达式:

(?<=\s|^)[a-z-]+(?=\s|$)

解释是:匹配一个或多个字母字符(仅小写,加上连字符),只要字符之前是空格(或字符串的开头),只要后面是空格或结尾字符串。

您现在需要做的就是将其插入System.Text.RegularExpressions.Regex.Matches(input, regexString)以获取您的单词列表。

参考: http: //www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet

于 2012-05-24T12:03:26.100 回答