php - 使用 PHP 检测 EOL 类型

Question

参考：这是一个自我回答的问题。它旨在分享知识，问答风格。

PS：我已经从头开始编写这段代码太久了，所以我决定在 SO 上分享它，另外，我相信有人会找到改进的方法。

score 9 · Accepted Answer

/**
 * Detects the end-of-line character of a string.
 * @param string $str The string to check.
 * @param string $default Default EOL (if not detected).
 * @return string The detected EOL, or default one.
 */
function detectEol($str, $default=''){
    static $eols = array(
        "\0x000D000A", // [UNICODE] CR+LF: CR (U+000D) followed by LF (U+000A)
        "\0x000A",     // [UNICODE] LF: Line Feed, U+000A
        "\0x000B",     // [UNICODE] VT: Vertical Tab, U+000B
        "\0x000C",     // [UNICODE] FF: Form Feed, U+000C
        "\0x000D",     // [UNICODE] CR: Carriage Return, U+000D
        "\0x0085",     // [UNICODE] NEL: Next Line, U+0085
        "\0x2028",     // [UNICODE] LS: Line Separator, U+2028
        "\0x2029",     // [UNICODE] PS: Paragraph Separator, U+2029
        "\0x0D0A",     // [ASCII] CR+LF: Windows, TOPS-10, RT-11, CP/M, MP/M, DOS, Atari TOS, OS/2, Symbian OS, Palm OS
        "\0x0A0D",     // [ASCII] LF+CR: BBC Acorn, RISC OS spooled text output.
        "\0x0A",       // [ASCII] LF: Multics, Unix, Unix-like, BeOS, Amiga, RISC OS
        "\0x0D",       // [ASCII] CR: Commodore 8-bit, BBC Acorn, TRS-80, Apple II, Mac OS <=v9, OS-9
        "\0x1E",       // [ASCII] RS: QNX (pre-POSIX)
        //"\0x76",       // [?????] NEWLINE: ZX80, ZX81 [DEPRECATED]
        "\0x15",       // [EBCDEIC] NEL: OS/390, OS/400
    );
    $cur_cnt = 0;
    $cur_eol = $default;
    foreach($eols as $eol){
        if(($count = substr_count($str, $eol)) > $cur_cnt){
            $cur_cnt = $count;
            $cur_eol = $eol;
        }
    }
    return $cur_eol;
}

笔记：

需要检查编码类型
~~需要以某种方式知道我们可能在像 ZX8x 这样的奇异系统上（因为 ASCII x76 是普通字母）~~ @radu 提出了一个很好的观点，在我的情况下，不值得努力很好地处理 ZX8x 系统。
我应该将功能一分为二吗？ mb_detect_eol()（多字节）和detect_eol()

score 7 · Accepted Answer

使用 regex 替换除新行之外的所有内容不是更容易吗？

_{点匹配单个字符，而不关心该字符是什么。唯一的例外是换行符。}

考虑到这一点，我们做了一些魔术：

$string = 'some string with new lines';
$newlines = preg_replace('/.*/', '', $string);
// $newlines is now filled with new lines, we only need one
$newline = substr($newlines, 0, 1);

不确定我们是否可以信任正则表达式来完成所有这些，但我没有任何东西可以测试。

在此处输入图像描述

score 4 · Accepted Answer

这里已经给出的答案为用户提供了足够的信息。以下代码（基于已经给出的答案）可能会有所帮助：

它提供了找到的 EOL 的参考

该检测还设置了一个可以由应用程序使用的密钥来引用该引用。

它展示了如何在实用程序类中使用引用。

展示如何使用它来检测返回找到的 EOL 的键名的文件。

我希望这对大家有用。

/**
Newline characters in different Operating Systems
The names given to the different sequences are:
============================================================================================
NewL  Chars       Name     Description
----- ----------- -------- ------------------------------------------------------------------
LF    0x0A        UNIX     Apple OSX, UNIX, Linux
CR    0x0D        TRS80    Commodore, Acorn BBC, ZX Spectrum, TRS-80, Apple II family, etc
LFCR  0x0A 0x0D   ACORN    Acorn BBC and RISC OS spooled text output.
CRLF  0x0D 0x0A   WINDOWS  Microsoft Windows, DEC TOPS-10, RT-11 and most other early non-Unix
                          and non-IBM OSes, CP/M, MP/M, DOS (MS-DOS, PC DOS, etc.), OS/2,
----- ----------- -------- ------------------------------------------------------------------
*/
const EOL_UNIX    = 'lf';        // Code: \n
const EOL_TRS80   = 'cr';        // Code: \r
const EOL_ACORN   = 'lfcr';      // Code: \n \r
const EOL_WINDOWS = 'crlf';      // Code: \r \n

然后在静态类实用程序中使用以下代码来检测

/**
Detects the end-of-line character of a string.
@param string $str      The string to check.
@param string $key      [io] Name of the detected eol key.
@return string The detected EOL, or default one.
*/
public static function detectEOL($str, &$key) {
   static $eols = array(
     Util::EOL_ACORN   => "\n\r",  // 0x0A - 0x0D - acorn BBC
     Util::EOL_WINDOWS => "\r\n",  // 0x0D - 0x0A - Windows, DOS OS/2
     Util::EOL_UNIX    => "\n",    // 0x0A -      - Unix, OSX
     Util::EOL_TRS80   => "\r",    // 0x0D -      - Apple ][, TRS80
  );

  $key = "";
  $curCount = 0;
  $curEol = '';
  foreach($eols as $k => $eol) {
     if( ($count = substr_count($str, $eol)) > $curCount) {
        $curCount = $count;
        $curEol = $eol;
        $key = $k;
     }
  }
  return $curEol;
}  // detectEOL

然后是一个文件：

/**
Detects the EOL of an file by checking the first line.
@param string  $fileName    File to be tested (full pathname).
@return boolean false | Used key = enum('cr', 'lf', crlf').
@uses detectEOL
*/
public static function detectFileEOL($fileName) {
   if (!file_exists($fileName)) {
     return false;
   }

   // Gets the line length
   $handle = @fopen($fileName, "r");
   if ($handle === false) {
      return false;
   }
   $line = fgets($handle);
   $key = "";
   <Your-Class-Name>::detectEOL($line, $key);

   return $key;
}  // detectFileEOL

将Your-Class-Name更改为实现类的名称（所有静态成员）。

score 3 · Accepted Answer

我的回答是，因为我既不能制作ohaal的作品，也不能制作transilvlad的作品，因此我的回答是：

function detect_newline_type($content) {
    $arr = array_count_values(
               explode(
                   ' ',
                   preg_replace(
                       '/[^\r\n]*(\r\n|\n|\r)/',
                       '\1 ',
                       $content
                   )
               )
           );
    arsort($arr);
    return key($arr);
}

解释：

两种提议的解决方案的总体思路都很好，但实现细节阻碍了这些答案的有用性。

实际上，这个函数的重点是返回文件中使用的换行符类型，并且换行符可以是一个或两个字符长。

仅此一项就导致使用str_split()不正确。正确切割标记的唯一方法是使用一个函数来切割具有可变长度的字符串，而不是基于字符检测。这就是explode()发挥作用的时候。

但是为了让有用的标记爆炸，有必要用正确的匹配替换正确的字符，在正确的数量。大多数魔法都发生在正则表达式中。

需要考虑3点：

按照ohaal.*的建议使用是行不通的。虽然确实不会匹配换行符，但在不是换行符或换行符一部分的系统上，将不正确地匹配它（提醒：我们正在检测换行符，因为它们可能与我们系统上的不同.否则没有意义）。.\r.
用任何东西替换/[^\r\n]*/将“起作用”以使文本消失，但一旦我们想要有分隔符就会成为一个问题（因为我们删除了除换行符之外的所有字符，任何不是换行符的字符都将是有效的分隔符）。因此，创建与换行符匹配的想法，并在替换中使用对该匹配项的反向引用。
在内容中，可能会有多个换行符连续出现。但是，我们不想在这种情况下对它们进行分组，因为其余代码会将它们视为不同类型的换行符。这就是为什么在反向引用的匹配中明确说明换行符列表的原因。

score 1 · Accepted Answer

基于ohaal的回答。

这可以为 EOL 返回一个或两个字符，例如 LF、CR+LF..

  $eols = array_count_values(str_split(preg_replace("/[^\r\n]/", "", $string)));
  $eola = array_keys($eols, max($eols));
  $eol = implode("", $eola);

score 0 · Accepted Answer

如果您只关心 LF/CR，这是我写的一种方法。无需处理您永远不会看到的所有可能的文件案例。

/**
 * @param  string  $path
 * @param  string  $format  real or human_readable
 * @return false|string
 * @author Sorin-Iulian Trimbitas
 */
public static function getLineBreak(string $path, $format = 'real')
{
    // Hopefully my idea is ok, the rest of the stuff from the internet doesn't seem to work ok in some cases
    // 1. Take the first line of the CSV
    $file = new \SplFileObject($path);
    $line = $file->getCurrentLine();
    // Do we have an empty line?
    if (mb_strlen($line) == 1) {
        // Try the next line
        $file->next();
        $line = $file->getCurrentLine();
        if (mb_strlen($line) == 1) {
            // Give up
            return false;
        }
    }
    // What does we have at its end?
    $last_char = mb_substr($line, -1);
    $penultimate_char = mb_substr($line, -2, 1);
    if ($last_char == "\n" || $last_char == "\r") {
        $real_format = $last_char;
        if ($penultimate_char == "\n" || $penultimate_char == "\r") {
            $real_format = $penultimate_char.$real_format;
        }
        if ($format == 'real') {
            return $real_format;
        }
        return str_replace(["\n", "\r"], ['LF', 'CR'], $real_format);
    }
    return false;
}

score 0 · Accepted Answer

我不使用 php 作为主要语言，但尽量简单和记忆，如果有一些更正，欢迎评论或编辑。

<?php
function eol_detect(&$str, $buffSize=1024) {
    $buff = substr($str, 0, $buffSize);
    $eol = null;

    if (strpos($buff, "\r\n") !== false)
        $eol = "\r\n";
    elseif (strpos($buff, "\n") !== false)
        $eol = "\n";
    elseif (strpos($buff, "\r") !== false)
        $eol = "\r";
    
    return $eol;
}

php - 使用 PHP 检测 EOL 类型

7 回答 7

解释：

Related

Reference