0

我在pdftotext
的帮助下从 pdf 文件生成文本 我的问题不在于 pdftotext,而是与相应地格式化文本有关

Salman              Madhuri             Mohnish             Renuka                Anupam
Khan                Dixit               Behl                Shahane               Kher
Prem                Nisha Chou...       Rajesh              Pooja Chou...         Prof. Siddh


Hum Aapke Hain Koun...! (1994) - Full cast and crew
www.imdb.com/title/tt0110076/fullcredits
Hum Aapke Hain Koun...! on IMDb: Movies, TV, Celebs, and more... ... IMDbPro.com
offers representation listings for over 120,000 individuals, including actors, ...

我需要输出为

Salman Khan Prem
Madhuri Dixit Nisha Chou...
Mohnish Behl Rajesh
Renuka Shahane Pooja Chou...
Anupam Kher Prof.

Hum Aapke Hain Koun...! (1994) - Full cast and crew
www.imdb.com/title/tt0110076/fullcredits
Hum Aapke Hain Koun...! on IMDb: Movies, TV, Celebs, and more... ... IMDbPro.com
offers representation listings for over 120,000 individuals, including actors, ...
4

1 回答 1

2

不确定你的分隔符是什么,但你可以像下面这样(有点难看,但它可以完成工作):

$namesAndContent = explode("\r\n\r\n", $theString);
$nameRows = explode("\r\n", $namesAndContent[0]);
$names = array();
foreach ($nameRows as $row) {
    $items = preg_split('/\s{2,}/', $row);
    foreach ($items as $index => $namePart) {
        if (!array_key_exists($index, $names)) {
            $names[$index] = array();
        }
        $names[$index][] = $namePart;
    }

}

foreach ($names as $name) {
    echo implode(' ', $name) . "\r\n";
}
echo "\r\n";
echo $namesAndContent[1];

演示:http ://codepad.viper-7.com/Nr1Q4t

以上将格式化数据(当分隔符正确时),但我想知道数据来自哪里(原始而不是 pdf),因为我怀疑有更好的方法来解决您的问题。也许有一些你可以直接使用的 API

于 2012-12-19T11:08:52.793 回答