java - How to extract the title which contains 3 or more a's?

Question

I created that code http://paste.ubuntu.com/5730390/ and I'm trying to extract titles which contain 3 or more a's (upercase or lowcase),also α's (greek letter) from some websites. I have already stored on a local hdd the websites content in txt format (there is a large number of websites).

My input in dfs is like: site_1.txt, site_2.txt, site_3.txt etc.

Supose that the titles below belong to site_1.txt,site_2.txt,site_3.txt respectively.

Academia.edu - Share research
Google
News12.gr | Αθλητική Ενημέρωση από τα Δωδεκάνησα

Now I want the output to contains: titles 1 and 3 (3 cause there is greek websites and contains a letter "α") in a form like:

Academia.edu - Share research, site_1.txt

News12.gr | Αθλητική Ενημέρωση από τα Δωδεκάνησα, site_2.txt

I tried regex pattern like "?:[αa{3,}]).(?:[αa{3}]).", but there is no results. Would anyone help with that?

Thanks In advance!

score 2 · Accepted Answer

要匹配 3 个 a 或 alpha，不一定彼此相邻，您可以使用此正则表达式：

(?:[αa].*){3}

score 1 · Accepted Answer

这听起来不像是一个hadoop问题，只是一个正则表达式问题。您只需要匹配aor alpha 3 次或更多次。下面的正则表达式可以解决问题"([aα].*){3,}"。

String files[] = {
        "Academia.edu - Share research",
        "Google",
        "News12.gr | Αθλητική Ενημέρωση από τα Δωδεκάνησα"};
String regexpattern = "([aα].*){3,}";
Pattern pattern = Pattern.compile(regexpattern);
for (String file: files){
    Matcher matcher = pattern.matcher(file);
    while (matcher.find()){
        System.out.println("file name matched '" + file+"'");
    }
}

score 0 · Accepted Answer

您可以使用replace来实现这一点：

public static int howMany(String str, char c) {
    String str2 = str.replace(c+"", "");
    return str.length() - str2.length();
}

然后你可以使用上面的方法：

for(String website : websites) {
    if(howMany(website, 'a') >= 3 || howMany(website, 'α')) {
        System.println(website);
    }
}

java - How to extract the title which contains 3 or more a's?

3 回答 3

Related

Reference