4

我有这种 CSV 文件,我必须用 Java 解析。

2012-11-01 00,  1106,   2194.1971066908
2012-11-01 01,  760,    1271.8460526316
.
.
.
2012-11-30 21,  1353,   1464.0014781966
2012-11-30 22,  1810,   1338.8331491713
2012-11-30 23,  1537,   1222.7826935589
        
720 rows selected.      
        
Elapsed: 00:37:00.23

这是我创建的 Java 代码,用于分隔每一列并将其存储在列表中。

public void extractFile(String fileName){
        try{
            BufferedReader bf = new BufferedReader(new FileReader(fileName));
            try {
                String readBuff = bf.readLine();
                
                while (readBuff!=null){
                    
                    Pattern checkData = Pattern.compile("[a-zA-Z]");
                    Matcher match = checkData.matcher(readBuff);
                    
                    if (match.find()){
                        readBuff = null;
                    }
                    
                    else if (!match.find()){
                        
                        String[] splitReadBuffByComma = new String[3];
                        splitReadBuffByComma = readBuff.split(",");
                        
                            for (int x=0; x<splitReadBuffByComma.length; x++){
                                
                                if (x==0){
                                    dHourList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==1){
                                    throughputList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==2){
                                    avgRespTimeList.add(splitReadBuffByComma[x]);
                                }
                            }
                    }
                    
                    readBuff = bf.readLine();
                }
            }
            finally{
                bf.close();
            }
        }
        catch(FileNotFoundException e){
            System.out.println("File not found dude: "+ e);
        }
        catch(IOException e){
            System.out.println("Error Exception dude: "+e);
        }
    }

问题是我创建的正则表达式有点错误,因为它仍然包含文本“选择了 720 行”并将它们存储在 dHourList 中。
dHourList 应该只存储像这样表示的日期列 "2012-11-01 00...etc"
throughputList = "1106, 760 ...etc"
avgResponseTime = "2194.192, 1271.846...etc"

正确的 reg 表达式应该是什么?

更新

2012-11-30 21 2012-11-30 22 2012-11-30 23

选择了 720 行。

经过:00:37:00.23

日期小时大小:724 吞吐量大小:720 平均响应时间大小:720

我在 checkData 正则表达式中使用了这个,因为如果我使用一个斜杠 \d 编译会说无效的转义序列

Pattern checkData = Pattern.compile("^(19|20)\\d\\d([-/.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])\b.+$"); 

但它仍然显示选择了 720 行和另一行不应该在那里。

更新 2

工作代码:

while (readBuff!=null){
                    
                    
                    Pattern checkData = Pattern.compile("^(19|20)\\d\\d([-/.])(0[1-9]|1[012])\\2(0[1-9]|[12][0-9]|3[01])\\b.+$");
                    
                    Matcher match = checkData.matcher(readBuff);
                    
                    if (!match.find()){
                        readBuff = null;
                    }
                    
                    else{
                        
                        String[] splitReadBuffByComma = new String[3];
                        splitReadBuffByComma = readBuff.split(",");
                        
                            for (int x=0; x<splitReadBuffByComma.length; x++){
                                
                                if (x==0){
                                    dHourList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==1){
                                    throughputList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==2){
                                    avgRespTimeList.add(splitReadBuffByComma[x]);
                                }
                            }
                    }
                    
                    readBuff = bf.readLine();
                }

我删除了 else if 条件并将其更改为 else 并使用 Cylian 建议的正则表达式现在我有输出

2012-11-30 21
2012-11-30 22
2012-11-30 23

Size of date-hour: 720 size of throughput: 720 size of avg resp time: 720

非常感谢!

4

3 回答 3

1

首先,在正则表达式^的开头插入 a 是有意义的。checkData然后表达式只会在行首而不是在整个字符串中查找,应该使其更快。

你可以让你的正则表达式以更多的日期格式开始,比如表达式(比如 4 个数字和一个破折号),就像在最后一行一样,行数之后永远不会有破折号。

也许是这样的:

Pattern checkData = Pattern.compile("^\\d\\d\\d\\d-");

如果您确定没有收到意外数据,这应该就足够了 - 如果您想确保即使您的 csv 数据格式不正确,您的程序也能正常工作,只需扩展正则表达式以捕获整行并matches()改用.

于 2013-01-30T09:35:12.990 回答
1

试试这个[你的代码,但有点修改]:

public void extractFile(String fileName){
        try{
            BufferedReader bf = new BufferedReader(new FileReader(fileName));
            try {
                String readBuff = bf.readLine();

                while (readBuff!=null){

                    Pattern checkData = Pattern.compile("^(19|20)\\d\\d([-/.])(0[1-9]|1[012])\\2(0[1-9]|[12][0-9]|3[01])\\b.+$");
                    Matcher match = checkData.matcher(readBuff);

                    if (!match.find()){
                        readBuff = null;
                    }

                    else if (match.find()){

                        String[] splitReadBuffByComma = new String[3];
                        splitReadBuffByComma = readBuff.split(",");

                            for (int x=0; x<splitReadBuffByComma.length; x++){

                                if (x==0){
                                    dHourList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==1){
                                    throughputList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==2){
                                    avgRespTimeList.add(splitReadBuffByComma[x]);
                                }
                            }
                    }

                    readBuff = bf.readLine();
                }
            }
            finally{
                bf.close();
            }
        }
        catch(FileNotFoundException e){
            System.out.println("File not found dude: "+ e);
        }
        catch(IOException e){
            System.out.println("Error Exception dude: "+e);
        }
    }

正则表达式解剖

# ^(19|20)\d\d([-/.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])\b.+$
# 
# Options: ^ and $ match at line breaks
# 
# Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
# Match the regular expression below and capture its match into backreference number 1 «(19|20)»
#    Match either the regular expression below (attempting the next alternative only if this one fails) «19»
#       Match the characters “19” literally «19»
#    Or match regular expression number 2 below (the entire group fails if this one fails to match) «20»
#       Match the characters “20” literally «20»
# Match a single digit 0..9 «\d»
# Match a single digit 0..9 «\d»
# Match the regular expression below and capture its match into backreference number 2 «([-/.])»
#    Match a single character present in the list “-/.” «[-/.]»
# Match the regular expression below and capture its match into backreference number 3 «(0[1-9]|1[012])»
#    Match either the regular expression below (attempting the next alternative only if this one fails) «0[1-9]»
#       Match the character “0” literally «0»
#       Match a single character in the range between “1” and “9” «[1-9]»
#    Or match regular expression number 2 below (the entire group fails if this one fails to match) «1[012]»
#       Match the character “1” literally «1»
#       Match a single character present in the list “012” «[012]»
# Match the same text as most recently matched by capturing group number 2 «\2»
# Match the regular expression below and capture its match into backreference number 4 «(0[1-9]|[12][0-9]|3[01])»
#    Match either the regular expression below (attempting the next alternative only if this one fails) «0[1-9]»
#       Match the character “0” literally «0»
#       Match a single character in the range between “1” and “9” «[1-9]»
#    Or match regular expression number 2 below (attempting the next alternative only if this one fails) «[12][0-9]»
#       Match a single character present in the list “12” «[12]»
#       Match a single character in the range between “0” and “9” «[0-9]»
#    Or match regular expression number 3 below (the entire group fails if this one fails to match) «3[01]»
#       Match the character “3” literally «3»
#       Match a single character present in the list “01” «[01]»
# Assert position at a word boundary «\b»
# Match any single character that is not a line break character «.+»
#    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert position at the end of a line (at the end of the string or before a line break character) «$»

更新

据我了解,您的输入字符串包含许多以日期开头但不包含逗号的行。为此,将先前的模式更改为以下内容:

^(19|20)\d\d([-/.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])\s+\d+,[^,]+,[^,]+$

或者escaped

^(19|20)\\d\\d([-/.])(0[1-9]|1[012])\\2(0[1-9]|[12][0-9]|3[01])\\s+\\d+,[^,]+,[^,]+$
于 2013-01-30T09:39:23.137 回答
1

你不必用正则表达式来做。(如果它显示为您的示例

你可以检查

  • 如果该行包含逗号“ ,”或

  • 如果拆分数组的长度为 3 或

  • 稍微改变一下while条件,如果行以“ selected.”结尾,则跳出。

于 2013-01-30T09:41:44.487 回答