php - 如何从pdf文件中获取文本并将其保存到数据库中

Question

可能重复：
php：pdf到字符串

我正在尝试将 pdf 文件的文本内容保存到数据库中。我发现这个链接对Converting PDF to string很有帮助，并着手处理它。但它只转换非常少的数据:(为什么会这样？

或者关于如何将复杂的pdf文件（包含页眉、页脚、表格、某些页面中的两列布局等）转换为字符串并将其保存到数据库的任何其他解决方案？

score 4 · Accepted Answer

很久以前，我写了一个脚本，可以下载 pdf 并将其转换为文本。此函数进行转换：

function pdf2string($sourcefile) {

$content = $sourcefile;

$searchstart = 'stream';
$searchend = 'endstream';
$pdfText = '';
$pos = 0;
$pos2 = 0;
$startpos = 0;

while ($pos !== false && $pos2 !== false) {

$pos = strpos($content, $searchstart, $startpos);
$pos2 = strpos($content, $searchend, $startpos + 1);

if ($pos !== false && $pos2 !== false){

if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) {
$pos += 2;
} else if ($content[$pos] == 0x0a) {
$pos++;
}

if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] == 0x0a) {
$pos2 -= 2;
} else if ($content[$pos2 - 1] == 0x0a) {
$pos2--;
}

$textsection = substr(
$content,
$pos + strlen($searchstart) + 2,
$pos2 - $pos - strlen($searchstart) - 1
);
$data = gzuncompress($textsection);
$pdfText .= pdfExtractText($data);
$startpos = $pos2 + strlen($searchend) - 1;

}
}

return preg_replace('/(\s)+/', ' ', $pdfText);

}

编辑：我称pdfExtractText()这个函数在这里定义：

function pdfExtractText($psData){

if (!is_string($psData)) {
return '';
}

$text = '';

// Handle brackets in the text stream that could be mistaken for
// the end of a text field. I'm sure you can do this as part of the
// regular expression, but my skills aren't good enough yet.
$psData = str_replace('\)', '##ENDBRACKET##', $psData);
$psData = str_replace('\]', '##ENDSBRACKET##', $psData);

preg_match_all(
'/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si',
$psData,
$matches
);
for ($i = 0; $i < sizeof($matches[0]); $i++) {
if ($matches[3][$i] != '') {
// Run another match over the contents.
preg_match_all('/\(([^)]*)\)/si', $matches[3][$i], $subMatches);
foreach ($subMatches[1] as $subMatch) {
$text .= $subMatch;
}
} else if ($matches[4][$i] != '') {
$text .= ($matches[1][$i] == 'Tc' ? ' ' : '') . $matches[4][$i];
}
}

// Translate special characters and put back brackets.
$trans = array(
'...' => '…',
'\205' => '…',
'\221' => chr(145),
'\222' => chr(146),
'\223' => chr(147),
'\224' => chr(148),
'\226' => '-',
'\267' => '•',
'\374'  => 'ü',
'\344'  => 'ä',
'\247'  => '§',
'\366'  => 'ö',
'\337'  => 'ß',
'\334'  => 'Ü',
'\326'  => 'Ö',
'\304'  => 'Ä',
'\(' => '(',
'\[' => '[',
'##ENDBRACKET##' => ')',
'##ENDSBRACKET##' => ']',
chr(133) => '-',
chr(141) => chr(147),
chr(142) => chr(148),
chr(143) => chr(145),
chr(144) => chr(146),
);
$text = strtr($text, $trans);

return $text;
}

EDIT2：要从本地文件中获取内容，请使用：

$fp = fopen($sourcefile, 'rb');
$content = fread($fp, filesize($sourcefile));
fclose($fp);

EDIT3：在将数据保存到数据库之前，我使用了转义函数：

function escape($str)
{
$search=array("\\","\0","\n","\r","\x1a","'",'"');
$replace=array("\\\\","\\0","\\n","\\r","\Z","\'",'\"');
return str_replace($search,$replace,$str);
}

php - 如何从pdf文件中获取文本并将其保存到数据库中

1 回答 1

Related

Reference