6

我需要根据一组单词从字符串中删除单词:

我要删除的词:

DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND

如果我收到如下字符串:

编辑:这个字符串已经从任何符号中“清除”了

THIS IS AN AMAZING WEBSITE AND LAYOUT

结果应该是:

THIS IS AMAZING WEBSITE LAYOUT

到目前为止,我有:

public static string StringWordsRemove(string stringToClean, string wordsToRemove)
{
    string[] splitWords = wordsToRemove.Split(new Char[] { ' ' });

    string pattern = "";

    foreach (string word in splitWords)
    {
        pattern = @"\b" + word + "\b";
        stringToClean = Regex.Replace(stringToClean, pattern, "");
    }

    return stringToClean;
}

但这并没有删除单词,知道吗?

我不知道我是否使用最有效的方法来做到这一点,也许将单词放在一个数组中只是为了避免一直拆分它们?

谢谢

4

7 回答 7

8
private static List<string> wordsToRemove =
    "DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND".Split(' ').ToList();

public static string StringWordsRemove(string stringToClean)
{
    return string.Join(" ", stringToClean.Split(' ').Except(wordsToRemove));
}

处理标点符号的修改:

public static string StringWordsRemove(string stringToClean)
{
    // Define how to tokenize the input string, i.e. space only or punctuations also
    return string.Join(" ", stringToClean
        .Split(new[] { ' ', ',', '.', '?', '!' }, StringSplitOptions.RemoveEmptyEntries)
        .Except(wordsToRemove));
}
于 2013-07-16T14:23:51.020 回答
1

我刚改了这条线

pattern = @"\b" + word + "\b";

对此

pattern = @"\b" + word + @"\b"; //added '@' 

我得到了结果

THIS IS AMAZING WEBSITE LAYOUT

String.Empty如果你使用而不是""喜欢会更好:

stringToClean = Regex.Replace(stringToClean, pattern, String.Empty);
于 2013-07-16T14:11:56.047 回答
1

我用过 LINQ

string exceptions = "DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND";
string[] exceptionsList = exceptions.Split(' ');

string test  ="THIS IS AN AMAZING WEBSITE AND LAYOUT";
string[] wordList = test.Split(' ');

string final = null;
var result = wordList.Except(exceptionsList).ToArray();
final = String.Join(" ",result);

Console.WriteLine(final);
于 2013-07-16T14:18:54.360 回答
0

输出你会得到“这是惊人的网站布局”。

我遇到了一个问题,它在结果中留下了“D”这个词(所以这是一个令人惊叹的网站 D 布局),因为如果你使用替换它只会替换这个词的某个部分。如果检测到您定义的字符,这将删除整个单词(我想这就是您想要的?)。

        string[] tabooWords = "DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND".Split(' ');
        string text = "THIS IS AN AMAZING WEBSITE AND LAYOUT";
        string result = text;

        foreach (string word in text.Split(' '))
        {
            if (tabooWords.Contains(word.ToUpper()))
            {
                int start = result.IndexOf(word);
                result = result.Remove(start, word.Length);
            }
        }
于 2013-07-16T14:17:43.973 回答
0
public static string StringWordsRemove(string stringToClean, string wordsToRemove)
{
    string[] splitWords = wordsToRemove.Split(new Char[] { ' ' });
    string pattern = " (" + string.Join("|", splitWords) + ") ";
    string cleaned=Regex.Replace(stringToClean, pattern, " ");
    return cleaned;
}
于 2013-07-16T14:18:40.373 回答
0

怎么样,

// make a pattern to match all words 
var pattern = string.Format(
    @"\b({0})\b",
    string.Join("|", wordsToremove.Split(new[] { ' ' })));

// pattern will be of the form "\b(badword1|badword2|...)\b"

// remove all the bad words from the string in one go.    
var cleanString = Regex.Replace(stringToClean, pattern, string.Empty);

// normalise the white space in the string (one space at a time)
var normalisedString = Regex.Replace(cleanString, @"\s+", " ");

第一行创建一个匹配任何要删除的单词的模式。第二行一次全部替换它们,从而节省了不必要的迭代。第三行规范化字符串中的空白。

于 2013-07-16T14:20:23.807 回答
0

或者...

stringToClean = Regex.Replace(stringToClean, @"\bDE\b|\bDA\b|\bDAS\b|\bDO\b|\bDOS\b|\bAN\b|\bNAS\b|\bNO\b|\bNOS\b|\bEM\b|\bE\b|\bA\b|\bAS\b|\bO\b|\bOS\b|\bAO\b|\bAOS\b|\bP\b|\bLDA\b|\bAND\b", String.Empty);
stringToClean = Regex.Replace(stringToClean, "  ", String.Empty);
于 2013-07-16T14:27:28.867 回答