0

I have a program which copies a word file (docx/doc) as follows:

A source file which is doc/docx is first copied to a temporary raw file where the extension is lost. Now the contents of this temporary raw file are to be copied to a file with suitable extension(doc/docx). Since, nothing is known at this point about the original file, it is required here to derive extension of the source Word Document from its contents.

   InputStream in = new FileInputStream ( src );
   OutputStream out = new FileOutputStream ( dst );
    byte [] buf = new byte [1024];
    int len;
    while ( ( len = in.read ( buf ) ) > 0 ) {
        out.write ( buf, 0, len );
    }

Destination dst is a raw file without any extension (say, 'sample-file'), which I can't change. The sourcesrc may be a 'doc' or a 'docx' type.
Now, as an output, I need to copy the contents of dst to a Word Document with proper format as of src(this 'proper format' is important here, otherwise the document is rendered useless). Since dst doesn't have any extension, I cannot find the file format by just looking at the name. Is there a way, I can retrieve the file extension from file contents? Hopefully, Word document must have some meta-data containing this information.

4

2 回答 2

2

http://www.forensicswiki.org/wiki/Word_Document_%28DOC%29 此链接详细介绍了许多不同的文件类型。它描述了 DOC 和 DOCX 文件的标题,因此您应该能够解析文件并确定它们的类型。

查看链接,.doc 文件是 OLE 复合文件,该文件应具有以下二进制标头:

d0 cf 11 e0 a1 b1 1a e1

相反,.docx 文件将具有二进制签名:

50 4b

此外,DOCX 文件是 ZIP 格式,其中前两个字节是字母 PK(以 ZIP 的创建者 Phil Katz 命名)。

希望这可以帮助!

于 2013-09-11T06:45:59.797 回答
1

如果您DOCX以二进制格式读取文件内容,则前两个字符将为"PK"。您可以使用它来识别它是否是DOCX文件。

于 2013-09-11T06:46:52.470 回答