0

对于我的数据结构项目,目标是读取包含超过 10000 首歌曲的提供的文件,其中清楚地标记了艺术家、标题和歌词,并且每首歌曲都用单双引号分隔。我编写了这段代码来解析文本文件,它可以运行,运行时间不到 3 秒,可以
读取 422K 行文本
创建一个 Song 对象
,将所述 Song 添加到 ArrayList

我写的解析代码是:

if (songSource.canRead()) {  //checks to see if file is valid to read
    readIn= new Scanner(songSource);
    while (readIn.hasNextLine()) {
 do {
     readToken= readIn.nextLine();

             if (readToken.startsWith("ARTIST=\"")) {
  artist= readToken.split("\"")[1];
      } 
      if (readToken.startsWith("TITLE=\"")) {
  title= readToken.split("\"")[1];
      } 
      if (readToken.startsWith("LYRICS=\"")) {
  lyrics= readToken.split("\"")[1];
      } else {
  lyrics+= "\n"+readToken;
      }//end individual song if block
 } while (!readToken.startsWith("\"")); //end inner while loop

    songList.add(new Song(artist, title, lyrics));

    }//end while not EOF 
} //end if file can be read 

我正在和我的 Intro to Algorithms 教授讨论这个项目的代码,他说我应该尽量在我的代码中更加防御,以允许其他人提供的数据不一致。最初我在 Artist、Title 和 Lyrics 字段之间使用 if/else 块,根据他的建议,我改为使用顺序 if 语句。虽然我可以理解他的观点,但使用此代码示例,我怎样才能更加防御允许输入不一致?

4

6 回答 6

4

我会替换例如:

artist= readToken.split("\"")[1];

String[] parts = readToken.split("\"");
if(parts.length >= 2) artist = parts[1];
else continue;

其他修改包括:

  1. 重置局部变量(因此,如果在第一首歌曲之后没有为某首歌曲提供艺术家,您不会意外地为一首歌曲找到错误的艺术家)
  2. 决定如果缺少某些数据该怎么办 - 您仍要将歌曲添加到歌曲列表中吗?
于 2010-09-21T01:50:11.677 回答
2

In the real world, there are some guarantees made regarding data integrity. In the case of dealing with user input (whether from stdin or a file) there is some project defined paradigm for notifying the user of a problem that requires attention.

For instance, when a compiler compiling code or a shell executing a script encounters an inconsistency it might halt and print the line containing the inconsistency with a second line below it that uses the "^" symbol to indicate the location of the problem.

So here are some basic question to ask yourself:
1. Is every line guaranteed to contain every field?
2. Is the ordering of the fields guaranteed?

If those are conditions of the input contract and are violated, you should ignore/report the line. If they are not conditions of the input, then you need to handle it .. which you currently do not.

于 2010-09-21T01:53:41.087 回答
2

You are assuming that the input is perfect. If you look at the way your application is currently setup, Based on a quick read of your algorithm the data would look like this

ARTIST="John"
TITLE="HELLO WORLD"
LYRICS="Sing Song All night long"
"

But consider the case

ARTIST="John"
TITLE="HELLO WORLD"
LYRICS="Sing Song All night long"
"
ARTIST="Peter"
LYRICS="Sing Song All night long"
"

Based on your algorithm, you now have 2 songs characterized as

songList = { Song("JOHN", "HELLO WORLD", "Sing Song All night long"),
             Song("Peter", "HELLO WORLD", "Sing Song All night long") }

With the current algorithm, the artist and title are exposed and will show up in the 2nd song even though they were not defined. You need to reset your three variables.

in your else you are just dumping the complete line into lyrics. What if you had already pulled Lyrics out, you are now overriding that. Test case

 ARTIST="John"
 LYRICS="Sing Song All night long"
 TILET="HELLO WORLD"
 "

Consider sending this record to an Error state. So when the batch read is completed, an error report can be generated and fixed.

Also you only consider EOF after an artist was read in. What if the EOF occurs during the Artist read, and the file does not end in ". You are going to get an exception there. In your do/while add another check for hasNextLine()

于 2010-09-21T02:05:45.723 回答
1

I see a couple of things that are missing here Jason.

I think the if/else was fine and it won't change the logic. However, you should restrict the scope of your variables as much as possible. By declaring artist, title, etc. inside of the while loop, they will be initialized to null (or whatever) so if an entry is missing the artist then it won't get the last entry's value.

Also, what happens if title, artist, etc. has a quote in it? How is that handled? How about the Lyrics which seem to be multiple lines right?

What happens if there is an unknown field -- maybe a misspelling? It will be added to the end of Lyrics which doesn't seem right. Only once the LYRICS field has been found should you append to it. If lyrics is null then it will start with "null".

于 2010-09-21T02:01:18.973 回答
0

Here are some issues that could be addressed:

  • Your code assumes that there is no whitespace before (for example) "ARTIST", none around the "=" sign and so on.

  • Your code assumes that the keywords are in all-caps. Someone could use lowercase or mixed case.

  • Your code assumes that a line that does not start with keyword=\" is a continuation of the song's lyrics. But what if the user entered ARTOST="Sting"? Or what if the user tried to use two lines for an artist name?

Finally, I'm not convinced that replacing "else if" with "if" in this case has made any difference to the code's robustness.

于 2010-09-21T02:00:24.230 回答
0

Deal with exceptions (I guess Scanner could throw InputMismatchException for an invalid character).

It looks like the do { } while (...) can loop endlessly if the file is ill-formed, and the end of the file is reached.

Nothing prevents artist or title from being empty.

于 2010-09-21T02:33:07.640 回答