c - dirent 不能使用 unicode

Question

我尝试计算文件夹中的文件，但 readdir 函数跳过包含 unicode 字符的文件。我在 c 中使用 dirent。

int filecount(char* path)
{
    int file_Count=0;
    DIR* dirp;
    struct dirent * entry;
    dirp = opendir(path);
    while((entry=readdir(dirp)) !=NULL)
    {
        if(entry->d_type==DT_REG)
        {
            ++file_Count;
        }
    }
    closedir(dirp);
    return file_Count;
}

score 4 · Accepted Answer

在 Mac OS X 10.9.1 Mavericks 上进行测试，我将您的代码改编为以下完整程序：

#include <dirent.h>
#include <stdio.h>

static
int filecount(char *path)
{
    int file_Count = 0;
    DIR *dirp;
    struct dirent *entry;
    dirp = opendir(path);
    while ((entry = readdir(dirp)) != NULL)
    {
        printf("Found (%llu)(%d): %s\n", entry->d_ino, entry->d_type, entry->d_name);
        if (entry->d_type == DT_REG)
        {
            ++file_Count;
        }
    }
    closedir(dirp);
    return file_Count;
}

static void proc_dir(char *dir)
{
    printf("Processing %s:\n", dir);
    printf("File count = %d\n", filecount(dir));
}

int main(int argc, char **argv)
{
    if (argc > 1)
    {
        for (int i = 1; i < argc; i++)
            proc_dir(argv[i]);
    }
    else
        proc_dir(".");
    return 0;
}

值得注意的是，它会在返回时列出每个条目——inode、type 和 name。在 Mac OS X 上，我被告知 inode 类型是__uint64_taka unsigned long long，因此使用%llufor 格式；YMMV 关于那个。

我还创建了一个文件夹utf8，并在该文件夹中创建了文件：

total 32
-rw-r--r--  1 jleffler  eng  6 Jan  7 12:14 ÿ-y-umlaut
-rw-r--r--  1 jleffler  eng  6 Jan  7 12:15 £
-rw-r--r--  1 jleffler  eng  6 Jan  7 12:14 €
-rw-r--r--  1 jleffler  eng  6 Jan  7 12:15 ™

每个文件都包含Hello一个换行符。当我运行命令（我称之为fc）时，它给出：

$ ./fc utf8
Processing utf8:
Found (8138036)(4): .
Found (377579)(4): ..
Found (8138046)(8): ÿ-y-umlaut
Found (8138067)(8): £
Found (8138054)(8): €
Found (8138078)(8): ™
File count = 4
$

欧元符号 € 是 U+20AC EURO SIGN，这远远超出了普通单字节代码集的范围。英镑符号 £ 是 U+00A3 POUND SIGN，所以它在拉丁字母 1 的范围内（ISO 8859-1、8859-15）。商标符号™是U+2122 TRADE MARK SIGN，也在普通单字节代码集的范围之外。

这表明，至少在某些平台上，readdir()使用非 Latin1 字符集中的 Unicode 字符的 UTF-8 编码文件名可以很好地工作。它还演示了我将如何调试问题 - 和/或说明我希望你运行什么（上面的程序）以及你应该运行它的目录类型，以使你的情况readdir()在你的平台上没有像 Unicode 文件名。

score 2 · Accepted Answer

尝试改变

if(entry->d_type==DT_REG)

至

if((entry->d_type==DT_REG || entry->d_type==DT_UNKNOWN) 
    && strcmp(entry->d_name,".")==0 && strcmp(entry->d_name,"..")==0)

这应该使您能够通过进一步计算未知类型的文件来计算这些文件。

请注意，strcmp(entry->d_name,".")==0和strcmp(entry->d_name,"..")==0用于排除子目录。

c - dirent 不能使用 unicode

2 回答 2

Related

Reference