matlab - 如何仅使用低级 I/O 命令将 CSV 数据导入 Matlab

Question

我真的很难弄清楚如何将包含 9 列和大约 400 行数据的 CSV 数据导入 Matlab 工作区中的表中。如果允许我使用 Matlab 必须提供的内置工具箱，这将很容易，但是，我需要尝试仅使用 fscanf、fopen 等命令来完成任务。数据本身是混合格式的每一列。十进制、浮点数、字符串等。我也被允许使用 CSVread，但我没有设法让它工作，因为据我了解 CSVread 仅适用于数值。

这是我的代码：

>> filename = 'datatext.csv'; %The file in CSV format
>> fid = fopen(filename); 
>> headers = fgetl(fid); %using fgetl to remove the headers and put them in a variable

我已经使用 fgetl 跳过文件的标题行并将它们添加到自己的变量中，但是我不确定在创建表时从哪里开始。基本上我想在 Matlab 的工作区中实现一个 400 行 x 9 列的表。

以下是文本文件的几行示例：

18,8,318,150,3436,11,70,1,sampletext
16,8,304,150,3433,12,70,1,sampletext2

我假设我将不得不使用某些单元格的内置转换函数，我可以这样做。为了获得正确的帮助，我可能错过了一些重要信息，但任何人的任何帮助都将不胜感激。谢谢你。

score 3 · Accepted Answer

读取文件的最低级别函数是（如Import Text Data Files with Low-Level I/O中所述）：

fscanf，读取文本或 ASCII 文件中的格式化数据；也就是说，您可以在文本编辑器中查看的文件。有关详细信息，请参阅以格式化模式读取数据。
fgetl和fgets一次读取文件的一行，其中换行符分隔每一行。有关详细信息，请参阅逐行读取数据。
fread，它读取字节或位级别的数据流。有关详细信息，请参阅使用低级 I/O 导入二进制数据。

在您的情况下，输入文件是ascii，而不是二进制文件，因此我们可以立即删除最后一个选项 ( fread)。

剩下的是fgetl/fgets（用于逐行读取文件，然后解析每一行）和fscanf.

您已经使用逐行方法获得了两个答案，所以我不会详细说明这个答案，而是向您展示如何使用fscanf（因为您的数据是合适的，它们确实是以格式化的模式组织的）。

使用 , 的好处fscanf是，只要您使用正确的formatSpec参数，该函数将能够一次读取整个文件，而不是逐行迭代。对于文件中的所有数字数据都是如此。我们将不得不对最后一列中的文本元素进行第二次传递。

定义格式说明符：

首先让我们定义您的格式规范。我们将为每个通道使用不同的格式。第一遍将读取所有数字数据，但会跳过文本字段，而第二遍则相反，忽略所有数字数据并仅读取文本字段。在定义格式说明符时，该'*'字符非常有用：

DataFormatSpec = repmat('%d,',1,8) ;
DataFormatSpec = [DataFormatSpec '%*s'] ; % yield:  '%d,%d,%d,%d,%d,%d,%d,%d,%*s'

TextFormatSpec = repmat('%*d,',1,8) ;
TextFormatSpec = [TextFormatSpec '%s'] ; % yield:  '%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%s'

我已经使用%d了所有列，因为在您的示例数据中我没有看到您提到的各种数字类型。如果数据需要，您可以轻松地替换%f它，并且您可以在同一个说明符中毫无问题地混合两种类型（只要它们都是数字的）。只需使用对每列有意义的内容。

有了这些，让我们来看看文件。数据在标题行之后是格式化的模式，所以我们首先需要在调用之前通过标题行fscanf。我们会像你一样做：

%% Open file and retrieve header line
fid = fopen( filein,'r') ;
hdr = fgetl(fid) ;              % Retrieve the header line
DataStartIndex = ftell(fid) ;   % save the starting index of data section

调用ftell允许我们保存文件指针的位置。在我们读取标题之后，指针位于数据段的开头。我们保存它以便能够将文件指针倒回到第二遍读取的同一点。

读取数值数据：

这可以通过fscanf一个简单的调用非常快速地完成：

%% First pass, read the "numeric values"
dataArray = fscanf( fid , DataFormatSpec , [8 Inf]).' ;

请注意行尾的转置运算符.'。这是因为fscanf填充它以列主要顺序读取的值，但文本文件中的数据以行主要顺序读取。最后的转置操作只是使输出数组的维度与文本文件中的相同。

现在dataArray 包含所有数字数据：

>> dataArray
dataArray =
          18           8         318         150        3436          11          70           1
          16           8         304         150        3433          12          70           1

读取文本数据：

这是它变得稍微复杂一些。fscanf自动将文本字符转换为其 ascii 值。将其转换回实际字符很容易（使用函数char()）。最大的障碍是，如果我们一口气读完所有文本字段，它们都将显示为连续的数字，但无法知道每个字符串在哪里停止以及下一个从哪里开始。为了克服这个问题，我们将逐行阅读，但仍然使用fscanf：

%% Second pass, read the "text" values
fseek(fid,DataStartIndex,'bof') ;   % Rewind file pointer to the start of data section
nRows = size(dataArray,1) ;         % How many lines we'll need to read
textArray = cell(nRows,1) ;         % pre allocate a cell array to receive the text column elements

for iline=1:nRows
    textArray{iline,1} = char( fscanf(fid,TextFormatSpec,1).' ) ;
end

fclose(fid) ;   % Close file

再次注意转置运算符.'的使用，以及char(). 现在textArray是一个包含所有文本字段的单元格数组：

>> textArray
textArray = 
    'sampletext'
    'sampletext2'

重组数据集：

就个人而言，我会将这两个数组分开，因为它们是每种数据类型（double数字数据cell的数组和字符串数组的数组）最优化的容器。但是，如果您需要将它们重新组合到一个数据结构中，您可以使用元胞数组：

%% Optional, merge data into cell array
FullArray = [num2cell(dataArray) textArray]
FullArray = 
    [18]    [8]    [318]    [150]    [3436]    [11]    [70]    [1]    'sampletext' 
    [16]    [8]    [304]    [150]    [3433]    [12]    [70]    [1]    'sampletext2'

或者你可以使用table：

%% Optional, merge data into a table
T = array2table(dataArray) ;
T.text = textArray ;
T.Properties.VariableNames = [cellstr(reshape(sprintf('v%d',1:8),2,[]).') ; {'text'}] ;

这使：

T = 
    v1    v2    v3     v4      v5     v6    v7    v8        text     
    __    __    ___    ___    ____    __    __    __    _____________
    18    8     318    150    3436    11    70    1     'sampletext' 
    16    8     304    150    3433    12    70    1     'sampletext2'

显然，如果您选择表格版本，请使用从标题中解析的变量名称，而不是我在本示例中使用的自动生成的变量名称。

score 2 · Accepted Answer

`fgetl()`反复使用检索文本文件行和分隔行内容使用`split()`

不确定是否split()允许在读取文本文件内容后使用该函数，但这是一个fgetl()使用循环逐行抓取文本文件内容的实现。在使用split()第二个参数检索所有行后，设置为逗号的分隔符,允许将内容拆分为单元格。第一个 for 循环逐行检索文本文件的内容并将其存储在名为Lines. 第二个 for 循环拆分Lines由分隔符存储的字符串，,允许将单元格存储在另一个字符串数组中，如下所示，内容分开。这里-1表示检索到错误条目/到达文件末尾时。

样本.txt

18,8,318,150,3436,11,70,1,sampletext
16,8,304,150,3433,12,70,1,sampletext2

脚本：

Text = "Start";

% open file (read only)
fileID = fopen('Sample.txt', 'r');
%Running for loop till end of file termination "-1"%
Line_Index = 1;
while(Text ~= "-1")
    % read line/row
    Text = string(fgetl(fileID));
    % stopping criterion
    if (Text ~= "-1")
        Lines(Line_Index,1) = Text;
    end
    % update row index
    Line_Index = Line_Index + 1;
end
% close file
fclose(fileID);

[Number_Of_Lines,~] = size(Lines);
Output_Array = strings(Number_Of_Lines,9);


for Row_Index = 1: Number_Of_Lines
    Line = split(Lines(Row_Index,:),',');
    Line = Line';
    Output_Array(Row_Index,:) = string(Line);
end

使用 MATLAB R2019b 运行

score 2 · Accepted Answer

尽管@MichaelTr7 的答案非常好，但我想建议一个更详细的答案，包括转换为类型并最终归还表格。请注意，它还包括变量的预分配。因为 MATLAB 将变量以一致的块存储在 RAM 中，所以最好事先告诉它您的变量将有多大。（MATLAB 实际上抱怨变量似乎在循环中增长......）

该解决方案还建立在fgetl和（稍后）split+ cell2table（这绝对不再是低级功能，但在您的情况下这可能没问题，因为它不再处理读取）

% USER-INPUT
FileName = 'Sample.csv';
strType = "%l,%f,%d,%f,%f,%f,%f,%f,%s";
delimiter = ",";



% allocate strings
Data = strings(100,1);

% open file (read only)
fileID = fopen(FileName, 'r');
%Running for loop till end of file termination "-1"%
Line_idx = 1;
while true
    % read line/row
    Line_text = string(fgetl(fileID));
    % stopping criterion
    if (Line_text == "-1")
        break
    end
    
    Data(Line_idx,1) = Line_text;
    % update row index
    Line_idx = Line_idx + 1;
    
    % extend allocation
    if Line_idx > size(Data,1)
        Data = [Data;strings(100,1)]; %#ok<AGROW>
    end
end
% close file
fclose(fileID);
% crop variable/rows
Data = Data(1:Line_idx-1,:);



strType_splt = split(strType,    delimiter);
Num_strType = length( strType_splt );
Num_Data    = length( split(Data(1,:),  delimiter) );

% check number of conversion types
assert( Num_strType == Num_Data, strcat("Conversion format 'strType' has not the same number of values as the file ",FileName,"."))



% allocate cell
C = cell(size(Data,1),Num_strType);

% loop over rows & columns + convert the elements
for r = 1:size(Data,1) % loop over rows
    line = Data(r);
    % split into individual strings
    line_splt = split(line,  delimiter);
    
    for c = 1:Num_strType % loop over columns
        element_str = line_splt(c);
        type = strType_splt(c);
        C{r,c} = convertStr( type, element_str );
    end
end
% create table
T = cell2table(C);




function element = convertStr(type,str)

    switch type
        case "%f" % float = single | convert to double and cast to single
            element = single(str2double(str));
        case "%l" % long float
            element = str2double(str);
        case "%d" % convert to double and cast to integer
            element = int32(str2double(str));
        case "%s"
            element = string(str);
        case "%D" % non-standard: datetime
            element = datetime(str);
        otherwise
            element = {str};
    end
end

这假设一个文件Sample.csv，例如具有以下内容：

18,8,318,150,3436,11,70,1,sampletext
16,8,304,150,3433,12,70,1,sampletext2

matlab - 如何仅使用低级 I/O 命令将 CSV 数据导入 Matlab

3 回答 3

定义格式说明符：

读取数值数据：

读取文本数据：

重组数据集：

fgetl()反复使用检索文本文件行和分隔行内容使用split()

样本.txt

脚本：

Related

Reference

`fgetl()`反复使用检索文本文件行和分隔行内容使用`split()`