java - 从文本中检测单词边界

Question

我在单词边界识别方面遇到了这个问题。我删除了维基百科文档的所有标记，现在我想获取实体列表。（有意义的术语）。我打算对文档进行二元组、三元组并检查它是否存在于字典（wordnet）中。有没有更好的方法来实现这一点。

以下是示例文本。我想识别实体（用双引号括起来）

Vulcans are a humanoid species in the fictional "Star Trek" universe who evolved on the planet Vulcan and are noted for their attempt to live by reason and logic with no interference from emotion They were the first extraterrestrial species officially to make first contact with Humans and later became one of the founding members of the "United Federation of Planets"

score 1 · Accepted Answer

我认为你所说的实际上仍然是一个新兴研究的主题，而不是应用成熟算法的简单问题。

我不能给你一个简单的“做这个”的答案，但这里有一些我脑海中的提示：

我认为使用 WordNet 可以工作（虽然不确定二元组/三元组的位置），但您应该将 WordNet 查找视为混合系统的一部分，而不是发现命名实体的全部和最终目的
然后，首先应用一些简单的常识标准（大写单词的序列；尝试将频繁的小写功能词（如“of”）容纳到这些标准中；由“已知标题”加上大写单词组成的序列）；
寻找从统计上你不希望作为实体候选者偶然出现的单词序列；
你能建立动态网页查找吗？（您的系统发现大写的序列“IBM”并查看它是否找到例如带有文本模式“IBM is ... [organisation|company|...]”的维基百科条目。
看看这里和“信息提取”文献中是否有任何东西给你一些想法：http ://www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_toc.html

事实是，当您查看那里的文献时，似乎人们并没有使用非常复杂、完善的算法。所以我认为有很多空间可以查看您的数据、探索并查看您能想出什么……祝您好运！

score 0 · Accepted Answer

有人问了一个类似的问题，关于如何在文本语料库中找到“有趣”的词。你应该阅读答案。特别是，Bolo 的回答指出了一篇有趣的文章，该文章使用单词出现的密度来决定它的重要性——使用观察到，当文本谈论某事时，它通常会相当频繁地指代某事。这篇文章很有趣，因为该技术不需要关于正在处理的文本的先验知识（例如，您不需要针对特定词典的字典）。

这篇文章提出了两种算法。

第一个算法根据测量的重要性对单个单词（例如“Federation”或“Trek”等）进行评级。它实现起来很简单，我什至可以在 Python 中提供一个（不是很优雅的）实现。

第二种算法更有趣，因为它通过完全忽略空格并使用树结构来决定如何拆分名词短语来提取名词短语（例如“星际迷航”等）。当应用于达尔文关于进化的开创性文本时，该算法给出的结果非常令人印象深刻。然而，我承认实现这个算法需要更多的思考，因为文章给出的描述相当难以捉摸，而且作者似乎有点难以追查。也就是说，我没有花太多时间，所以你可能会有更好的运气。

score 0 · Accepted Answer

如果我理解正确，您正在寻找由双引号 (") 分隔的子字符串。您可以在正则表达式中使用捕获组：

    String text = "Vulcans are a humanoid species in the fictional \"Star Trek\"" +
        " universe who evolved on the planet Vulcan and are noted for their " +
        "attempt to live by reason and logic with no interference from emotion" +
        " They were the first extraterrestrial species officially to make first" +
        " contact with Humans and later became one of the founding members of the" +
        " \"United Federation of Planets\"";
    String[] entities = new String[10];                 // An array to hold matched substrings
    Pattern pattern = Pattern.compile("[\"](.*?)[\"]"); // The regex pattern to use
    Matcher matcher = pattern.matcher(text);            // The matcher - our text - to run the regex on
    int startFrom   = text.indexOf('"');                // The index position of the first " character
    int endAt       = text.lastIndexOf('"');            // The index position of the last " character
    int count       = 0;                                // An index for the array of matches
    while (startFrom <= endAt) {                        // startFrom will be changed to the index position of the end of the last match
        matcher.find(startFrom);                        // Run the regex find() method, starting at the first " character
        entities[count++] = matcher.group(1);           // Add the match to the array, without its " marks
        startFrom = matcher.end();                      // Update the startFrom index position to the end of the matched region
    }

或者用字符串函数编写一个“解析器”：

    int startFrom = text.indexOf('"');                              // The index-position of the first " character
    int nextQuote = text.indexOf('"', startFrom+1);                 // The index-position of the next " character
    int count = 0;                                                  // An index for the array of matches
    while (startFrom > -1) {                                        // Keep looping as long as there is another " character (if there isn't, or if it's index is negative, the value of startFrom will be less-than-or-equal-to -1)
        entities[count++] = text.substring(startFrom+1, nextQuote); // Retrieve the substring and add it to the array
        startFrom = text.indexOf('"', nextQuote+1);                 // Find the next " character after nextQuote
        nextQuote = text.indexOf('"', startFrom+1);                 // Find the next " character after that
    }

在这两种情况下，为了示例，示例文本都是硬编码的，并且假定存在相同的变量（名为的字符串变量text）。

如果要测试entities数组的内容：

    int i = 0;
    while (i < count) {
        System.out.println(entities[i]);
        i++;
    }

我必须警告你，边界/边界情况可能存在问题（即，当 " 字符位于字符串的开头或结尾时。如果 " 字符的奇偶校验不均匀（即如果有是文本中奇数个 " 字符)。您可以事先使用简单的奇偶校验：

    static int countQuoteChars(String text) {
        int nextQuote = text.indexOf('"');              // Find the first " character
        int count = 0;                                  // A counter for " characters found
        while (nextQuote != -1) {                       // While there is another " character ahead
            count++;                                    // Increase the count by 1
            nextQuote = text.indexOf('"', nextQuote+1); // Find the next " character
        }
        return count;                                   // Return the result
    }

    static boolean quoteCharacterParity(int numQuotes) {
        if (numQuotes % 2 == 0) { // If the number of " characters modulo 2 is 0
            return true;          // Return true for even
        }
        return false;             // Otherwise return false
    }

请注意，如果numQuotes碰巧0这个方法仍然返回true（因为 0 模任何数字都是 0，所以(count % 2 == 0)将是true）尽管如果没有 " 字符，你不想继续解析，所以你想检查这种情况某处。

希望这可以帮助！

java - 从文本中检测单词边界

3 回答 3

Related

Reference