16

我正在解析电子邮件。当我看到对电子邮件的回复时,我想删除引用的文本,以便我可以将文本附加到上一封电子邮件(即使它是回复)。

通常,您会看到:

第一封电子邮件(对话开始)

This is the first email

第二封电子邮件(回复第一封邮件)

This is the second email

Tim said:
This is the first email

其输出仅为“这是第二封电子邮件”。尽管不同的电子邮件客户端引用文本的方式不同,但如果有办法只获取大部分新的电子邮件文本,那也是可以接受的。

4

6 回答 6

15

我使用以下正则表达式来匹配引用文本的开头(最后一个是重要的):

  /** general spacers for time and date */
  private static final String spacers = "[\\s,/\\.\\-]";

  /** matches times */
  private static final String timePattern  = "(?:[0-2])?[0-9]:[0-5][0-9](?::[0-5][0-9])?(?:(?:\\s)?[AP]M)?";

  /** matches day of the week */
  private static final String dayPattern   = "(?:(?:Mon(?:day)?)|(?:Tue(?:sday)?)|(?:Wed(?:nesday)?)|(?:Thu(?:rsday)?)|(?:Fri(?:day)?)|(?:Sat(?:urday)?)|(?:Sun(?:day)?))";

  /** matches day of the month (number and st, nd, rd, th) */
  private static final String dayOfMonthPattern = "[0-3]?[0-9]" + spacers + "*(?:(?:th)|(?:st)|(?:nd)|(?:rd))?";

  /** matches months (numeric and text) */
  private static final String monthPattern = "(?:(?:Jan(?:uary)?)|(?:Feb(?:uary)?)|(?:Mar(?:ch)?)|(?:Apr(?:il)?)|(?:May)|(?:Jun(?:e)?)|(?:Jul(?:y)?)" +
                                              "|(?:Aug(?:ust)?)|(?:Sep(?:tember)?)|(?:Oct(?:ober)?)|(?:Nov(?:ember)?)|(?:Dec(?:ember)?)|(?:[0-1]?[0-9]))";

  /** matches years (only 1000's and 2000's, because we are matching emails) */
  private static final String yearPattern  = "(?:[1-2]?[0-9])[0-9][0-9]";

  /** matches a full date */
  private static final String datePattern     = "(?:" + dayPattern + spacers + "+)?(?:(?:" + dayOfMonthPattern + spacers + "+" + monthPattern + ")|" +
                                                "(?:" + monthPattern + spacers + "+" + dayOfMonthPattern + "))" +
                                                 spacers + "+" + yearPattern;

  /** matches a date and time combo (in either order) */
  private static final String dateTimePattern = "(?:" + datePattern + "[\\s,]*(?:(?:at)|(?:@))?\\s*" + timePattern + ")|" +
                                                "(?:" + timePattern + "[\\s,]*(?:on)?\\s*"+ datePattern + ")";

  /** matches a leading line such as
   * ----Original Message----
   * or simply
   * ------------------------
   */
  private static final String leadInLine    = "-+\\s*(?:Original(?:\\sMessage)?)?\\s*-+\n";

  /** matches a header line indicating the date */
  private static final String dateLine    = "(?:(?:date)|(?:sent)|(?:time)):\\s*"+ dateTimePattern + ".*\n";

  /** matches a subject or address line */
  private static final String subjectOrAddressLine    = "((?:from)|(?:subject)|(?:b?cc)|(?:to))|:.*\n";

  /** matches gmail style quoted text beginning, i.e.
   * On Mon Jun 7, 2010 at 8:50 PM, Simon wrote:
   */
  private static final String gmailQuotedTextBeginning = "(On\\s+" + dateTimePattern + ".*wrote:\n)";


  /** matches the start of a quoted section of an email */
  private static final Pattern QUOTED_TEXT_BEGINNING = Pattern.compile("(?i)(?:(?:" + leadInLine + ")?" +
                                                                        "(?:(?:" +subjectOrAddressLine + ")|(?:" + dateLine + ")){2,6})|(?:" +
                                                                        gmailQuotedTextBeginning + ")"
                                                                      );

我知道在某些方面这是矫枉过正(而且可能很慢!)但它工作得很好。如果您发现任何与此不匹配的内容,请告诉我,以便我改进!

于 2010-07-08T04:09:31.270 回答
7

查看谷歌专利:http ://www.google.com/patents/US7222299

总之,它们对文本的部分进行哈希处理(可能类似于句子),然后在之前的消息中查找哈希值的匹配项。超级快,他们可能也将其用作线程算法的输入。真是个好主意!

于 2013-08-09T16:16:35.193 回答
2

当以前的邮件存储在磁盘上,或者可用的时候,你可以查看所有的邮件,由特定的接收者发送来确定,哪个是响应文本。

您还可以尝试通过检查最后几行的第一个字符来确定引号字符。通常最后一行总是以相同的字符开头。

当最后两行以不同字符开头时,您可以尝试第一行,因为有时答案会附加在文本的末尾。

如果您检测到这些字符,您可以删除以该字符开头的最后一行,直到检测到空行或以另一个字符开头的行。

未经测试,更像是伪代码

    String[] lines;

    // Check the size of the array first, length > 2
    char startingChar = lines[lines.length - 1].charAt(0);
    int foundCounter = 0;
    for (int i = lines.length - 2; i >=0; --i) {
        String line = lines[i];

        // Check line size > 0
        if(startingChar == line.charAt(0)){
            ++foundCounter;
        }
    }

    final int YOUR_DECISION = 2; // You can decide
    if(foundCounter > YOUR_DECISION){
        deleteLastLinesHere(startingChar, foundCounter);
    }
于 2010-03-05T08:29:59.933 回答
2

RegEx 工作正常,除了它匹配从主题开始的文本并忽略“主题”之前的所有内容

Text
-------- Original Message -------- 
<TABLE border="0" cellpadding="0" cellspacing="0">
  <TBODY>
    <TR>
      <TH align="right" valign="baseline">
      // the matcher starts working from here
于 2011-04-11T15:45:53.743 回答
1

通过观察 Gmail 在这方面的行为,我观察了他们的策略:

  1. 写完整的第二封邮件。
  2. 附加文本,如:在 [timestamp],[first email sender name] <[first email sender email address]> 写道:
  3. 附加完整的第一封电子邮件。一个。如果您的电子邮件是纯文本格式,则在第一封电子邮件的每一行之前添加“>”。湾。如果它是 HTML 格式,那么 Gmail 会给出一个左侧边距,例如:

    左边框:1px 实心#CCC;边距:0px 0px 0px 0.8ex;左填充:1ex;用户代理样式表块引用

    然后附加第一封电子邮件的文本。

在解析来自 Gmail 地址的电子邮件时,您可以对此进行逆向工程。我没有调查其他客户,但他们应该有相同的行为。

于 2010-03-05T08:33:37.493 回答
1

只需几行代码,您就可以得到几乎正确的结果:

String newMessage = "";
for (String line : emailLines) {
  if (!line.matches("^[>].*")) {
    newMessage = newMessage.concat(line);
  }
}

如有必要,您可以为留下不同引用文本签名的电子邮件客户端添加其他正则表达式检查。

于 2010-03-07T01:43:01.413 回答