oracle - 如何从保留原始格式的 pdf 中获取文本（使用 CTX_DOC）？

翻译自：https://stackoverflow.com/questions/30962654 2015-06-21T08:31:38.347

648 次

我使用此代码过滤pdf文件中的文本：

create or replace directory pdf_dir as '&1';

create or replace directory l_curr_dir as '&3';

declare
  ll_clob     CLOB;
  l_bfile     BFILE;
  l_filename  VARCHAR2(200) := '&2';
begin
  begin
    ctx_ddl.drop_preference('testfilter');
    ctx_ddl.drop_policy('testdimac_policy1');
  exception when others then
    null;
  end;

  ctx_ddl.create_preference('testfilter', 'AUTO_FILTER');
  ctx_ddl.create_policy('testd_policy1', 'testfilter');

  l_bfile := bfilename('PDF_DIR', l_filename);

  dbms_lob.fileopen(l_bfile);

  ctx_doc.policy_filter(
      policy_name => 'test_policy1'
    , document    => l_bfile
    , restab      => ll_clob
    , plaintext   => true
    , CHARSET     => 'US7ASCII'
  );

DBMS_XSLPROCESSOR.clob2file (ll_clob,'L_CURR_DIR' , '&4');
/

该解决方案很好并且对我有用，但是有什么方法可以获取表格数据，现在它正在逐个短语或逐行过滤文本。

例如，如果 pdf 包含以下值：

Name:            Amount  
Pradeep          100 USD

我希望输出原样，但当前设置给出的输出如下：

Name:
Amount
Pradeep
100 USD

有什么方法可以获取文本的原始格式pdf吗？

是否可以更换过滤器？

oracle - 如何从保留原始格式的 pdf 中获取文本（使用 CTX_DOC）​​？

0 回答 0

Related

Reference

oracle - 如何从保留原始格式的 pdf 中获取文本（使用 CTX_DOC）？