我有一个我无法解决的问题..有人可以帮助我吗?
好的,所以我正在尝试使用一个程序来规范化我的文本,它会删除多个空格,打印原始文件中的其他字符,并放置空格以及开始和结束符号。
所以转换,在我写好txt文件并打开后,我看到了这样的内容:
numa situaã § ã £ o de emergãªncia mã © dica
如您所见,有一些我不想要的奇怪字符,也许是因为编码?这是我的语言葡萄牙语的文本。
这是我的代码,我该如何解决?
public static void main(String[] args) throws IOException {
Charset encoding = Charset.defaultCharset();
InputStream in = new FileInputStream(new File("data.txt"));
Reader reader = new InputStreamReader(in, encoding);
Reader buffer = new BufferedReader(reader);
StringBuilder normalizedLanguage = new StringBuilder("<");
int r;
while ((r = buffer.read()) != -1) {
char ch = (char) r;
boolean newline = false;
boolean hasLetterBefore = false;
boolean hasLetterAfter = false;
char symbol = '-';
int lines = 0;
if (newline)
{
normalizedLanguage.append("\n<");
}
if (ch == '\r' || ch == '\n' )
{
lines++;
normalizedLanguage.append(">");
newline = true;
hasLetterBefore = false;
}
else if (Character.isLetterOrDigit(ch))
{
if (hasLetterBefore == true)
{
normalizedLanguage.append(Character.toString(symbol) + Character.toString(Character.toLowerCase(ch)));
}else{
normalizedLanguage.append(Character.toString(Character.toLowerCase(ch)));
}
newline = false;
hasLetterBefore = true;
}
else if (ch == ' ')
{
normalizedLanguage.append(Character.toString(ch));
newline = false;
hasLetterBefore = false;
}
else if (ch == '\t')
{
System.out.println("Tab detected: " + ch);
newline = false;
hasLetterBefore = false;
}
else
{
//Símbolos, entre outros..
if (!hasLetterBefore)
{
normalizedLanguage.append(" " + Character.toString(ch) + " ");
}
else
{
symbol = ch;
}
newline = false;
}
}
String normalizedLanguageString = normalizedLanguage.toString().trim().replaceAll(" +", " ");
PrintWriter out = new PrintWriter("data_after.txt");
out.println(normalizedLanguageString);
out.close();
buffer.close();
reader.close();
in.close();
}
非常感谢你;)