我正在从 twitter 收集数据并对其进行处理,但我遇到的问题是:文本很脏,
例子 :
String dirtyText="this*is#a*&very_dirty&String";
例子 :
String dirtyText="All f dis happnd bcause u gave ur time, talent n passion.";
请我希望它尽可能简单。
我正在从 twitter 收集数据并对其进行处理,但我遇到的问题是:文本很脏,
例子 :
String dirtyText="this*is#a*&very_dirty&String";
例子 :
String dirtyText="All f dis happnd bcause u gave ur time, talent n passion.";
请我希望它尽可能简单。
public class CleaningDirtText { /* * 删除前导和尾随空格,并将我们的单词拆分成一个字符串数组。* split 方法允许您在给定的分隔符上拆分文本。在这种 * 的情况下,我们选择使用正则表达式 \W,它表示任何不是单词字符的 *: /private static final StringdirtyText = "this is#a*&very_dirty&String";
public static void main(String[] args) {
System.out.println(dirtyText);
String[] words = dirtyText.toLowerCase().trim().split("[\\W\\d]+");
// System.out.println(preparedText);
//String[] cleanText = dirtyText.toLowerCase().trim().split("[\\W]+");
for (String clean : words) {
System.out.print(clean + " ");
}
}
}
public class CleaningDirtText { private static final StringdirtyText = "this is#a &very_dirty&String";
public static void main(String[] args) {
/*
* remove leading and trailing spaces, and split our words into a String array.
* The split method allows you to break apart text on a given delimiter. In this
* case, we chose to use the regular expression \\W, which represents anything
* that is not a word character:
*/
System.out.println(dirtyText);
String[] words = dirtyText.toLowerCase().trim().split("[\\W\\d]+");
for (int i = 0; i < words.length; i++) {
System.out.print(words[i]);
}
System.out.println("\nsee the cleand text:-");
for (String clean : words) {
System.out.print(clean + " ");
}
}
}