0

我正忙于对 OpenBravoPOS 进行一些扩展,以阅读我们订购产品的公司的发票。

此发票以 PDF 格式创建。我使用 Itext Library 来阅读特定的订单行。问题是我能够阅读我需要的页面,在一个大字符串中。这个字符串看起来像

LEVERINGSBON 30/06/2012 27828/2012/NL/WebShop   Distributeur ID nummer: 15099191 Uw distributeur: Klant Naam: FM Point Marcel Snoeck Adres: Zonnedauw 17 5953MS Reuver Telefoon: +31654317017 E-MAIL: yvonneenmarcel@home.nl Opmerking: -  Lp. Rekening Totaal FV/39525/2012/NL     vd Wal Sandra 72.00 1 3 x 354 - Luxury Collection 50ml NEW! 72.00 FV/39526/2012/NL     Slaats Tim 6.00 2 1 x KR01 - Eye Pencil DECADENCE BLACK 6.00 FV/39527/2012/NL     Nabben Britt 44.95 3 3 x E013 - Krachtreiniger 1000ml 24.75 4 2 x E016 -Tapijtreiniger 1000ml 9.20 5 1 x 3 Step Mascara PERFECT BLACK 11.00 FV/39528/2012/NL     Nabben Lieke 32.00 6 1 x 192 - Luxury Collection 50ml 21.00 7 1 x 3 Step Mascara PERFECT BLACK 11.00 FV/39529/2012/NL     Claessens Patrick 12.40 8 1 x P101 - Peeling VERBENA 12.40 FV/39530/2012/NL     Smits Yolanda 56.00 9 1 x E006 - Wasmiddel VIVID COLOURS 1000ml 7.00 10 2 x B023 - Body Lotion 200ml NEW 18.40 11 2 x 023 - Classic Collection 30ml 30.60 FV/39531/2012/NL     van Pol-Thijssen Silvia 34.70 12 1 x 110 - Classic Collection 50ml 15.30 13 1 x N003 - Nagellak HOT RED 7.00 14 1 x P103 - Peeling CHERRY BLOSSOM 12.40 Aantal: 21 Totaal: 258.05 € 1.17.4564.29482 1/1        "

我试图做的是读取每一行,并确定这是否是订单行,如果是,我需要将其放入数据库中。

一个订单行看起来像

2 1 x KR01 - Eye Pencil DECADENCE BLACK 6.00

您可以阅读如下;订单第 2 行,第 1 件产品 KR01 描述 Eye Pencil Decadence Black,价格为 6.00

有没有一种简单的方法来读取这个长字符串并将其与正确的订单行分开。

感谢您的回复

到目前为止,我的代码是:

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package part4.chapter15;

import com.itextpdf.text.pdf.PdfArray;
import com.itextpdf.text.pdf.PdfName;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;

public class ExtractPageContent {

    /** The original PDF that will be parsed. */
    public static final String PREFACE = "C:/Users/marcel/Documents/FM/NL/FMPoint        /Kassa_voorraad_software/PDF-Itext/PDF_Results_Import_Files/small.pdf" ;
    /** The resulting text file. */
    public static final String RESULT = "C:/Users/marcel/Documents/FM/NL/FMPoint        /Kassa_voorraad_software/PDF-Itext/PDF_Results_Import_Files/sample-            result.txt" ;

    /**
     * Parses a PDF to a plain text file.
     * @param pdf the original PDF
     * @param txt the resulting text
 * @throws IOException
 */
public void parsePdf(String pdf, String txt) throws IOException {

        /** Putting result in Array, to be able extract to Table */
        PdfArray array;

        PdfReader reader = new PdfReader(pdf);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        PrintWriter out = new PrintWriter(new FileOutputStream(txt));
        TextExtractionStrategy strategy;
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
            String str = strategy.getResultantText();
            CharSequence FindPage = "Lp. Rekening Totaal"; 
            if  (str.contains(FindPage)){ 
              out.println(strategy.getResultantText());
        }
        }
        out.flush();
        out.close();
    }

    /**
     * Main method.
     * @param    args    no arguments needed
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        new ExtractPageContent().parsePdf(PREFACE, RESULT);
    }

}
4

2 回答 2

3

您可以设计一个regex以多种不同方式解决此问题的方法。这是一个:

    String pdf = "LEVERINGSBON 30/06/2012 27828/2012/NL/WebShop   Distributeur ID nummer: 15099191 Uw distributeur: Klant Naam: FM Point Marcel Snoeck Adres: Zonnedauw 17 5953MS Reuver Telefoon: +31654317017 E-MAIL: yvonneenmarcel@home.nl Opmerking: - Lp. Rekening Totaal FV/39525/2012/NL     vd Wal Sandra 72.00 1 3 x 354 - Luxury Collection 50ml NEW! 72.00 FV/39526/2012/NL     Slaats Tim 6.00 2 1 x KR01 - Eye Pencil DECADENCE BLACK 6.00 FV/39527/2012/NL     Nabben Britt 44.95 3 3 x E013 - Krachtreiniger 1000ml 24.75 4 2 x E016 -Tapijtreiniger 1000ml 9.20 5 1 x 3 Step Mascara PERFECT BLACK 11.00 FV/39528/2012/NL     Nabben Lieke 32.00 6 1 x 192 - Luxury Collection 50ml 21.00 7 1 x 3 Step Mascara PERFECT BLACK 11.00 FV/39529/2012/NL     Claessens Patrick 12.40 8 1 x P101 - Peeling VERBENA 12.40 FV/39530/2012/NL     Smits Yolanda 56.00 9 1 x E006 - Wasmiddel VIVID COLOURS 1000ml 7.00 10 2 x B023 - Body Lotion 200ml NEW 18.40 11 2 x 023 - Classic Collection 30ml 30.60 FV/39531/2012/NL     van Pol-Thijssen Silvia 34.70 12 1 x 110 - Classic Collection 50ml 15.30 13 1 x N003 - Nagellak HOT RED 7.00 14 1 x P103 - Peeling CHERRY BLOSSOM 12.40 Aantal: 21 Totaal: 258.05 € 1.17.4564.29482 1/1        ";
    String patternString = "\\d\\s\\d\\sx.*?\\d\\.\\d\\d";
    Matcher matcher = Pattern.compile(patternString).matcher(pdf);
    List<String> dataRows = new ArrayList<String>();
    while (matcher.find()) {
        dataRows.add(matcher.group());
    }
    System.out.println(dataRows);

正则表达式的解释:
\\d\\s\\d\\sx : 匹配数字、空格、数字、空格、'x'
.*?: 匹配任意数量的任意字符,但匹配非贪婪为什么这很重要? \\d\.\\d\\d:将最后一个数字与两位小数匹配
这可能需要根据数据的变化进行调整,但这应该是一个很好的起点。

如果您需要自定义数据结构的列表而不是 String,则可以像这样获取匹配的各个部分:

...  
String patternString = "(\\d)\\s(\\d)\\sx.*?\\d\\.\\d\\d";
...
while (matcher.find()) {
    MyDataObj m = new MyDataObj();
    m.setSomeField(dataRows.add(matcher.group(1)));
    m.setAnotherField(dataRows.add(matcher.group(2)));
}

只需将您希望保留在模式中的每个值都包含在 parathensis 中,然后使用 等检索它们matcher.group(1)matcher.group(2)matcher.group(0)您提供整个匹配项)

于 2012-07-09T11:01:28.323 回答
0

答案的结果很好以下代码结果如下:

/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */
package part4.chapter15;

import com.itextpdf.text.pdf.PdfArray;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class ExtractPageContent {

/** The original PDF that will be parsed. */
    public static final String PREFACE = "C:/Users/marcel/Documents/FM/NL/FMPoint/Kassa_voorraad_software/PDF-Itext/PDF_Results_Import_Files/big.pdf" ;
    /** The resulting text file. */
    public static final String RESULT = "C:/Users/marcel/Documents/FM/NL/FMPoint/Kassa_voorraad_software/PDF-Itext/PDF_Results_Import_Files/sample-result.txt" ;

    /**
     * Parses a PDF to a plain text file.
     * @param pdf the original PDF
     * @param txt the resulting text
     * @throws IOException
     */
    public void parsePdf(String pdf, String txt) throws IOException {

        /** Putting result in Array, to be able extract to Table */
        PdfArray array;

        PdfReader reader = new PdfReader(pdf);
        PdfReaderContentParser parser = new PdfReaderContentParser(reader);
        PrintWriter out = new PrintWriter(new FileOutputStream(txt));
        TextExtractionStrategy strategy;
        for (int i = 1; i <= reader.getNumberOfPages(); i++) {
            strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
            String str = strategy.getResultantText();
            CharSequence FindPage = "Lp. Rekening Totaal"; 
            if  (str.contains(FindPage)){ 
/*                Pattern re =  Pattern.compile("(\\d+)\\s(\\d+)(\\xA0)x(\\xA0)(.*?)(\\d+\\.\\d{2})"); */
                /* Pattern for orders of Artikels with product Code */
                Pattern re2 =  Pattern.compile("(\\d+)\\s(\\d+)(\\xA0)x(\\xA0)(\\w+)(\\xA0)-\\s(.*?)(\\d+\\.\\d{2})"); 
                Matcher m = re2.matcher(str);
                int mIdx = 0;
                while (m.find()){
                    for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
                        /*System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));*/
                        System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
                    }
                    mIdx++;
                }

/**     System.out.println(dataRows); */

          out.println(strategy.getResultantText());
    }
    }
    out.flush();
    out.close();
}


/**
 * Main method.
 * @param    args    no arguments needed
 * @throws IOException
 */
public static void main(String[] args) throws IOException {
    new ExtractPageContent().parsePdf(PREFACE, RESULT);
}

}

输出结果如下所示。

完成订单行[0][0] = 4 3 x 023 - 经典系列 30ml 45.90

行号[0][1] = 4

数量[0][2] = 3

[0][3] =  

[0][4] =  

产品代码[0][5] = 023

[0][6] =  

产品描述[0][7] = 经典系列 30ml

价格[0][8] = 45.90

[1][0] = 5 2 x C052 - 手和指甲霜 100ml 新 15.20

[1][1] = 5

[1][2] = 2

[1][3] =  

[1][4] =  

[1][5] = C052

[1][6] =  

[1][7] = 手指甲霜 100ml 新

[1][8] = 15.20

感谢您的大力支持

于 2012-07-11T09:49:25.157 回答