5

我正在用java编写一个程序,它读取文件的输入流,通过根据密码改变字节数来加密它,然后创建一个新的加密文件。

例如:
我创建了一个包含以下文字的测试文件:
This is a test to see if the encrypter project works.
当我在 java 中读取字节时,我得到:
[84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 116, 101, 115, 116, 32, 116, 111, 32, 115, 101, 101, 32, 105, 102, 32, 116, 104, 101, 32, 101, 110, 99, 114, 121, 112, 116, 101, 114, 32, 112, 114, 111, 106, 101, 99, 116, 32, 119, 111, 114, 107, 115, 46, 10]
然后我取每个字节的值,减去密码的 unicode 值,得到绝对值. 然后我把它写到一个文件中。

我正在使用不同的算法对其进行加密,并开始在测试文本文件上对其进行测试。我使用的是 Linux,所以没有文件扩展名(例如 .txt、.pdf 等)我注意到在加密几次后,计算机不再将其识别为文本文件,而是,作为图像文件!(这意味着当您单击它时,默认情况下,它会尝试在图像编辑器中打开文件)

所以这是我的问题:

1、是什么导致计算机将文件识别为某种文件类型?

  • 我猜它与它在文件中某处查看的某些字节有关,但除此之外,我迷路了。

2. 这些信息存储在文件中的什么位置?

  • 我希望即使在加密之后也能够将文件保持为相同的文件类型,所以我在想,例如,如果文件类型信息在前 10 个字节中,我会加密之后的所有内容,但是例如留下前 10 个字节。

3.文件类型信息标准吗?

  • 这些字节是否具有在所有平台上都是标准的含义(即 pdf 文件是 pdf 文件,无论您在哪台计算机上使用它。是因为.pdf扩展名,还是因为文件。)

4.假设文件类型由于文件中的字节而被识别,我该如何更改文件类型?

  • 我在哪里可以找到文件中哪些字节表示什么的列表?
4

2 回答 2

4

在传统的 UNIX 系统上,仅通过查找文件中出现的特定字节模式来识别文件。

file命令使用 包含定义这些字节模式的规则的magic配置文件(通常/etc/magic是 或)。/usr/share/file/magic

就是这样——没有特殊的额外元数据——这一切都是通过对内容的分析来完成的。

于 2012-04-12T21:07:51.137 回答
2

Usually it will be within the first few bytes of the file.

From Wikipedia:

Internal Metadata
A second way to identify a file format is to store information regarding the format inside the file itself. Usually, such information is written in one (or more) binary string(s), tagged or raw texts placed in fixed, specific locations within the file. Since the easiest place to locate them is at the beginning of it, such area is usually called a file header when it is greater than a few bytes, or a magic number if it is just a few bytes long.

Although the file-type is not necessarily going to be stored in the first few bytes, it can be stored elsewhere

The metadata contained in a file header are not necessarily stored only at the beginning but might be present in other areas too, often including the end of the file; it depends on the file format or the type of data it contains. Character-based (text) files have character-based human-readable headers, whereas binary formats usually feature binary headers, although that is not a rule: a human-readable file header may require more bytes, but is easily discernible with simple text or hexadecimal editors. File headers may not only contain the information required by algorithms to identify the file format alone, but also real metadata about the file and its contents. For example most image file formats store information about image size, resolution, color space/format and optionally other authoring information like who, when and where it was made, what camera model and shooting parameters was it taken with (if any, cfr. Exif), and so on. Such metadata may be used by a program reading or interpreting the file both during the loading process and after that, but can also be used by the operating system to quickly capture information about the file itself without loading it all into memory.

Another method of storing the file-type inside the file is using magic numbers

One way to incorporate such metadata, often associated with Unix and its derivatives, is just to store a "magic number" inside the file itself. Originally, this term was used for a specific set of 2-byte identifiers at the beginning of a file, but since any undecoded binary sequence can be regarded as a number, any feature of a file format which uniquely distinguishes it can be used for identification. GIF images, for instance, always begin with the ASCII representation of either GIF87a or GIF89a, depending upon the standard to which they adhere. Many file types, most especially plain-text files, are harder to spot by this method. HTML files, for example, might begin with the string (which is not case sensitive), or an appropriate document type definition that starts with


The file-type doesn't even have to be stored inside the file. Other methods include filename-extentions or even External Metadata

A final way of storing the format of a file is to explicitly store information about the format in the file system, rather than within the file itself. This approach keeps the metadata separate from both the main data and the name, but is also less portable than either file extensions or "magic numbers", since the format has to be converted from filesystem to filesystem. While this is also true to an extent with filename extensions — for instance, for compatibility with MS-DOS's three character limit — most forms of storage have a roughly equivalent definition of a file's data and name, but may have varying or no representation of further metadata.

There are many other ways too, but these tend to be the most common.

于 2012-04-23T04:11:53.733 回答