0

I created that code http://paste.ubuntu.com/5730390/ and I'm trying to extract titles which contain 3 or more a's (upercase or lowcase),also α's (greek letter) from some websites. I have already stored on a local hdd the websites content in txt format (there is a large number of websites).

My input in dfs is like: site_1.txt, site_2.txt, site_3.txt etc.

Supose that the titles below belong to site_1.txt,site_2.txt,site_3.txt respectively.

  1. Academia.edu - Share research

  2. Google

  3. News12.gr | Αθλητική Ενημέρωση από τα Δωδεκάνησα

Now I want the output to contains: titles 1 and 3 (3 cause there is greek websites and contains a letter "α") in a form like:

Academia.edu - Share research, site_1.txt

News12.gr | Αθλητική Ενημέρωση από τα Δωδεκάνησα, site_2.txt

I tried regex pattern like "?:[αa{3,}]).(?:[αa{3}]).", but there is no results. Would anyone help with that?

Thanks In advance!

4

3 回答 3

2

要匹配 3 个 a 或 alpha,不一定彼此相邻,您可以使用此正则表达式:

(?:[αa].*){3}
于 2013-06-03T20:09:49.493 回答
1

这听起来不像是一个hadoop问题,只是一个正则表达式问题。您只需要匹配aor alpha 3 次或更多次。下面的正则表达式可以解决问题"([aα].*){3,}"

String files[] = {
        "Academia.edu - Share research",
        "Google",
        "News12.gr | Αθλητική Ενημέρωση από τα Δωδεκάνησα"};
String regexpattern = "([aα].*){3,}";
Pattern pattern = Pattern.compile(regexpattern);
for (String file: files){
    Matcher matcher = pattern.matcher(file);
    while (matcher.find()){
        System.out.println("file name matched '" + file+"'");
    }
}
于 2013-06-03T20:14:57.997 回答
0

您可以使用replace来实现这一点:

public static int howMany(String str, char c) {
    String str2 = str.replace(c+"", "");
    return str.length() - str2.length();
}

然后你可以使用上面的方法:

for(String website : websites) {
    if(howMany(website, 'a') >= 3 || howMany(website, 'α')) {
        System.println(website);
    }
}
于 2013-06-03T20:18:24.247 回答