search - 如何在 Powershell 中搜索 PDF 文档/PDX 目录

Question

我有一个供应商将他们的文档库作为一系列 PDF 文件（和一些 CHM 文件）提供，并且还包括一个 .PDX 目录。

我想编写一个 powershell 脚本来前端它（使用 powershell 表单，或在 asp.net 中托管 powershell）。

我处于早期阶段，我已经研究了如何从 PDF 流（PDF 文件末尾附近的 xmpmeta XML 元数据块 - 文件中为数不多的纯文本流之一）获取文档信息像这样：

    <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 4.2.1-c043 52.372728, 2009/01/18-15:08:04 
       "><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><rdf:Description rdf:about="
" xmlns:pdf="http://ns.adobe.com/pdf/1.3/"><pdf:Producer>GPL Ghostscript 8.64</pdf:Producer><pdf:Keywo
rds>86000056-413</pdf:Keywords></rdf:Description><rdf:Description rdf:about="" xmlns:xmp="http://ns.ad
obe.com/xap/1.0/"><xmp:ModifyDate>2011-03-03T17:38:34-05:00</xmp:ModifyDate><xmp:CreateDate>2011-01-28
T23:12:07+05:30</xmp:CreateDate><xmp:CreatorTool>PScript5.dll Version 5.2</xmp:CreatorTool><xmp:Metada
taDate>2011-03-03T17:38:34-05:00</xmp:MetadataDate></rdf:Description><rdf:Description rdf:about="" xml
ns:xmpMM="http://ns.adobe.com/xap/1.0/mm/"><xmpMM:DocumentID>6cb2263d-2d61-11e0-0000-1390d57dcfcb</xmp
MM:DocumentID><xmpMM:InstanceID>uuid:1a0e68ba-14ad-4a03-b7a1-0a0e127b8753</xmpMM:InstanceID></rdf:Desc
ription><rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/"><dc:format>applicati
on/pdf</dc:format><dc:title><rdf:Alt><rdf:li xml:lang="x-default">I/O Subsystem Programming Guide</rdf
:li></rdf:Alt></dc:title><dc:creator><rdf:Seq><rdf:li>Unisys Information Development</rdf:li></rdf:Seq
></dc:creator><dc:description><rdf:Alt><rdf:li xml:lang="x-default">ClearPath MCP 13.1,Application Dev
elopment,Administration,ClearPath MCP</rdf:li></rdf:Alt></dc:description></rdf:Description></rdf:RDF><
/x:xmpmeta>

使用以下代码（powershell v3，在 v2 中，您需要选择并展开属性[string]$title = ($rdf.GetElementsByTagName('dc:title')| Select -expand Alt|Select -expand li)."#text"）：

$file = ".\Downloads\68698703-007\PDF\86000056-413.pdf"

#determine what line in file the xmpmeta string starts
[int]$startln = (select-string -pattern '^<x:' $file).ToString().Split(":")[2]

#determine what line in file the xmpmeta string ends
[int]$endln = (select-string -pattern '^</x:' $file).ToString().Split(":")[2]
$startln--

#grab the xmpmeta and cast as type xml
[xml]$xmp = (gc $file)["$startln".."$endln"]
[xml]$rdf = $xmp.xmpmeta.InnerXml

#get title/creator/description element text
[string]$title = $rdf.GetElementsByTagName('dc:title').Alt.li."#text"
[string]$creator = $rdf.GetElementsByTagName('dc:creator').Alt.li."#text"
[string]$description = $rdf.GetElementsByTagName('dc:description').Alt.li."#text"

这很关键，因为文件名的格式为 12345678-123.pdf，实际标题在元数据本身以及文档类别等中。

因此，我可以生成一个文档列表（显示它们的正确标题，而不是真实文件名）并允许它们启动，但我也希望能够使用 PDX 文件搜索所有文档，但这绝不是纯文本！

我想我可以使用许多工具中的一种将每个 PDF 转换为文本，搜索它，为每个文档重复，然后为每个文档返回结果。

但是，让我印象深刻的是，Adobe Reader 已经这样做了，所以我可以使用将启动搜索的开关启动 AcroRd32.exe，使用我已传递给 AcroRd32 程序的搜索词，或者我可以使用来自的 Adobe Search.API在 Powershell 中？

关于在 Adobe Reader 中自动加载 .PDX 并启动搜索或在 powershell 中使用 adobe 的 API 的任何想法？

编辑：
我现在可以从命令行启动 acrobat 并进行搜索（因此可以在 powershell 中进行模拟），但搜索仅在搜索 PDF 时有效，而不是 PDX 目录。两者都会打开搜索窗格，但只有在 PDF 文档中才会填充搜索字段并执行搜索。

C:\Program Files (x86)\Adobe\Reader 10.0\Reader>AcroRd32.exe /A "search=trim" "P:\Doc Library\PDF\00_home.pdx"

或者

C:\Program Files (x86)\Adobe\Reader 10.0\Reader>AcroRd32.exe /A "search=trim" "P:\Doc Library\PDF\86000056-413.pdf"

问候，格雷厄姆

score 0 · Accepted Answer

This is an old post, but be aware that the searching you do is potentially dangerous and that there is a better way to find the XMP metadata in a PDF file. XMP was designed specifically to be "findable" by text search. To that purpose it has a well defined begin and end code defined that is in there specifically so that you can extract the XMP data without having to parse the PDF format (or any other format the XMP metadata blob might be embedded in.

You can download the XMP specification here: http://www.adobe.com/devnet/xmp.html. Part 1 is the part where the explanation about XMP Packets explains how a text scanner can find the XMP packet with more accuracy.

Finally, PDF has an additional quirk that allows it to be incrementally updated. This might cause multiple XMP packets to appear in the file (where the last packet is normally the correct one). But annoyingly when the PDF is exported from applications like InDesign, images in the PDF (and other objects) might also have their own "object" XMP attached to it.

So consider where your files come from and how many strange things you might encounter and you want to provision for. But reading the XMP specification is not a bad idea for sure.

search - 如何在 Powershell 中搜索 PDF 文档/PDX 目录

1 回答 1

Related

Reference