python - 从 CSV 文件读取数据并从字符串转换为正确的数据类型，包括整数列表列

Question

当我从 CSV 文件读回数据时，每个单元格都被解释为一个字符串。

如何自动将我读入的数据转换为正确的类型？
或者更好：如何告诉 csv 阅读器每列的正确数据类型？

（我写了一个二维列表，其中每一列都是不同的类型（bool、str、int、整数列表），输出到一个 CSV 文件。）

样本数据（在 CSV 文件中）：

IsActive,Type,Price,States
True,Cellphone,34,"[1, 2]"
,FlatTv,3.5,[2]
False,Screen,100.23,"[5, 1]"
True,Notebook, 50,[1]

score 16 · Accepted Answer

正如文档解释的那样，CSV 阅读器不执行自动数据转换。您有 QUOTE_NONNUMERIC 格式选项，但这只会将所有未引用的字段转换为浮点数。这是与其他 csv 阅读器非常相似的行为。

我不相信 Python 的 csv 模块对这种情况有任何帮助。正如其他人已经指出的那样，literal_eval()是一个更好的选择。

以下确实有效并转换：

字符串
整数
花车
列表
字典

您也可以将它用于布尔值和 NoneType，尽管必须对它们进行相应的格式化literal_eval()才能通过。当在 Python 中布尔值大写时，LibreOffice Calc 以大写字母显示布尔值。此外，您必须用None（不带引号）替换空字符串

我正在为执行所有这些操作的 mongodb 编写一个导入器。以下是我到目前为止编写的代码的一部分。

[注意：我的 csv 使用制表符作为字段分隔符。您可能还想添加一些异常处理]

def getFieldnames(csvFile):
    """
    Read the first row and store values in a tuple
    """
    with open(csvFile) as csvfile:
        firstRow = csvfile.readlines(1)
        fieldnames = tuple(firstRow[0].strip('\n').split("\t"))
    return fieldnames

def writeCursor(csvFile, fieldnames):
    """
    Convert csv rows into an array of dictionaries
    All data types are automatically checked and converted
    """
    cursor = []  # Placeholder for the dictionaries/documents
    with open(csvFile) as csvFile:
        for row in islice(csvFile, 1, None):
            values = list(row.strip('\n').split("\t"))
            for i, value in enumerate(values):
                nValue = ast.literal_eval(value)
                values[i] = nValue
            cursor.append(dict(zip(fieldnames, values)))
    return cursor

score 8 · Accepted Answer

你必须映射你的行：

data = """True,foo,1,2.3,baz
False,bar,7,9.8,qux"""

reader = csv.reader(StringIO.StringIO(data), delimiter=",")
parsed = (({'True':True}.get(row[0], False),
           row[1],
           int(row[2]),
           float(row[3]),
           row[4])
          for row in reader)
for row in parsed:
    print row

结果是

(True, 'foo', 1, 2.3, 'baz')
(False, 'bar', 7, 9.8, 'qux')

score 7 · Accepted Answer

我知道这是一个相当老的问题，标记为 python-2.5，但这里的答案适用于 Python 3.6+，使用该语言的更新版本的人们可能会感兴趣。

typing.NamedTuple它利用了Python 3.5 中添加的内置类。从文档中可能不明显的是，每个字段的“类型”可以是一个函数。

示例用法代码还使用所谓的f-string文字，直到 Python 3.6 才添加，但不需要使用它们来进行核心数据类型转换。

#!/usr/bin/env python3.6
import ast
import csv
from typing import NamedTuple


class Record(NamedTuple):
    """ Define the fields and their types in a record. """
    IsActive: bool
    Type: str
    Price: float
    States: ast.literal_eval  # Handles string represenation of literals.

    @classmethod
    def _transform(cls: 'Record', dict_: dict) -> dict:
        """ Convert string values in given dictionary to corresponding Record
            field type.
        """
        return {name: cls.__annotations__[name](value)
                    for name, value in dict_.items()}


filename = 'test_transform.csv'

with open(filename, newline='') as file:
    for i, row in enumerate(csv.DictReader(file)):
        row = Record._transform(row)
        print(f'row {i}: {row}')

输出：

row 0: {'IsActive': True, 'Type': 'Cellphone', 'Price': 34.0, 'States': [1, 2]}
row 1: {'IsActive': False, 'Type': 'FlatTv', 'Price': 3.5, 'States': [2]}
row 2: {'IsActive': True, 'Type': 'Screen', 'Price': 100.23, 'States': [5, 1]}
row 3: {'IsActive': True, 'Type': 'Notebook', 'Price': 50.0, 'States': [1]}

通过创建一个仅包含通用类方法的基类来概括这一点并不简单，因为实现的方式typing.NamedTuple。

为了避免这个问题，在 Python 3.7+ 中，dataclasses.dataclass可以使用 a 代替，因为它们没有继承问题——所以创建一个可以重用的通用基类很简单：

#!/usr/bin/env python3.7
import ast
import csv
from dataclasses import dataclass, fields
from typing import Type, TypeVar

T = TypeVar('T', bound='GenericRecord')

class GenericRecord:
    """ Generic base class for transforming dataclasses. """
    @classmethod
    def _transform(cls: Type[T], dict_: dict) -> dict:
        """ Convert string values in given dictionary to corresponding type. """
        return {field.name: field.type(dict_[field.name])
                    for field in fields(cls)}


@dataclass
class CSV_Record(GenericRecord):
    """ Define the fields and their types in a record.
        Field names must match column names in CSV file header.
    """
    IsActive: bool
    Type: str
    Price: float
    States: ast.literal_eval  # Handles string represenation of literals.


filename = 'test_transform.csv'

with open(filename, newline='') as file:
    for i, row in enumerate(csv.DictReader(file)):
        row = CSV_Record._transform(row)
        print(f'row {i}: {row}')

从某种意义上说，使用哪一个并不是很重要，因为从未创建过类的实例——使用一个只是在记录数据结构中指定和保存字段名称及其类型的定义的一种干净方式。

ATypedDict被添加到typingPython 3.8 中的模块中，它也可用于提供类型信息，但必须以稍微不同的方式使用，因为它实际上并没有定义新类型 like NamedTupleand dataclassesdo——因此它需要一个独立的转换功能：

#!/usr/bin/env python3.8
import ast
import csv
from dataclasses import dataclass, fields
from typing import TypedDict


def transform(dict_, typed_dict) -> dict:
    """ Convert values in given dictionary to corresponding types in TypedDict . """
    fields = typed_dict.__annotations__
    return {name: fields[name](value) for name, value in dict_.items()}


class CSV_Record_Types(TypedDict):
    """ Define the fields and their types in a record.
        Field names must match column names in CSV file header.
    """
    IsActive: bool
    Type: str
    Price: float
    States: ast.literal_eval


filename = 'test_transform.csv'

with open(filename, newline='') as file:
    for i, row in enumerate(csv.DictReader(file), 1):
        row = transform(row, CSV_Record_Types)
        print(f'row {i}: {row}')

score 2 · Accepted Answer

Jon Clements 和 cortopy 教我的道具ast.literal_eval！这是我最终的结果（Python 2；对 3 的更改应该是微不足道的）：

from ast import literal_eval
from csv import DictReader
import csv


def csv_data(filepath, **col_conversions):
    """Yield rows from the CSV file as dicts, with column headers as the keys.

    Values in the CSV rows are converted to Python values when possible,
    and are kept as strings otherwise.

    Specific conversion functions for columns may be specified via
    `col_conversions`: if a column's header is a key in this dict, its
    value will be applied as a function to the CSV data. Specify
    `ColumnHeader=str` if all values in the column should be interpreted
    as unquoted strings, but might be valid Python literals (`True`,
    `None`, `1`, etc.).

    Example usage:

    >>> csv_data(filepath,
    ...          VariousWordsIncludingTrueAndFalse=str,
    ...          NumbersOfVaryingPrecision=float,
    ...          FloatsThatShouldBeRounded=round,
    ...          **{'Column Header With Spaces': arbitrary_function})
    """

    def parse_value(key, value):
        if key in col_conversions:
            return col_conversions[key](value)
        try:
            # Interpret the string as a Python literal
            return literal_eval(value)
        except Exception:
            # If that doesn't work, assume it's an unquoted string
            return value

    with open(filepath) as f:
        # QUOTE_NONE: don't process quote characters, to avoid the value
        # `"2"` becoming the int `2`, rather than the string `'2'`.
        for row in DictReader(f, quoting=csv.QUOTE_NONE):
            yield {k: parse_value(k, v) for k, v in row.iteritems()}

（我有点担心我可能错过了一些涉及引用的极端案例。如果您发现任何问题，请发表评论！）

score 1 · Accepted Answer

我也非常喜欢@martineau 的方法，并且对他的评论特别感兴趣，即他的代码的本质是字段和类型之间的干净映射。这向我表明，字典也可以工作。因此，他的主题变体如下所示。它对我来说效果很好。

很明显，字典中的 value 字段实际上只是一个可调用的，因此可以用来为数据按摩和类型转换提供一个钩子，如果有人选择的话。

import ast
import csv

fix_type = {'IsActive': bool, 'Type': str, 'Price': float, 'States': ast.literal_eval}

filename = 'test_transform.csv'

with open(filename, newline='') as file:
    for i, row in enumerate(csv.DictReader(file)):
        row = {k: fix_type[k](v) for k, v in row.items()}
        print(f'row {i}: {row}')

输出

row 0: {'IsActive': True, 'Type': 'Cellphone', 'Price': 34.0, 'States': [1, 2]}
row 1: {'IsActive': False, 'Type': 'FlatTv', 'Price': 3.5, 'States': [2]}
row 2: {'IsActive': True, 'Type': 'Screen', 'Price': 100.23, 'States': [5, 1]}
row 3: {'IsActive': True, 'Type': 'Notebook', 'Price': 50.0, 'States': [1]}

score 0 · Accepted Answer

替代使用的替代方法（尽管看起来有点极端）是PyPi 上可用ast.literal_eval的模块 - 并查看http://pyparsing.wikispaces.com/file/view/parsePythonValue.py代码示例是否适合您需要，或者可以很容易地适应。pyparsing

score 0 · Accepted Answer

这是我对这个问题的看法，以防您必须处理多种 csv 格式、一些额外的自定义数据整理以在某些列上执行，并输出为列表列表或元组的元组。

类型显示为字符串，因为列类型存储在 Python 代码之外的数据库中。如果需要，它还可以添加一些自定义类型。

不过，我还没有针对超大文件对此进行测试，因为在我的生产代码中我使用了 pandas，并且此代码用于一些测试设置。但我想这比其他一些答案消耗更多的内存，因为来自 csv 的所有数据都是一次加载的。

dict_headers_type = {
    "IsActive": "bool",
    "Type": "str",
    "Price": "float",
    "State": "list",
}

dict_converters = {
    "bool": x: bool(x),
    "float": x: float(x),
    "list": x: ast.literal_eval(x),
}

dict_header_converter = {
    header: dict_converters[my_type]
    for header, my_type in dict_headers_type.items()
    if my_type in dict_converters.keys()
}

到位后，我们可以执行转换：

with open(csv_path) as f:
    data = [line for line in csv.reader(f)]

# list of the converters to apply
ls_f = [
    dict_header_converter[header]
    if header in dict_header_converter.keys() else None
    for header in data[0]
]


ls_records = [f(datapoint) if f else datapoint
     for f, datapoint in zip(ls_f, row)]
    for row in data[1:]]

# to add headers, if needed:
ls_records.insert(0, data[0])

输出：

[
  ['IsActive','Type','Price','State']
  [True, 'Cellphone', 34.0, [1, 2]],
  [False, 'FlatTv', 3.5, [2]],
  [True, 'Screen', 100.23, [5, 1]],
  [True, 'Notebook', 50.0, [1]],
]

score 0 · Accepted Answer

我喜欢@martineau的回答。它很干净。

我需要的一件事是只转换几个值并将所有其他字段保留为字符串，例如将字符串作为默认值并仅更新特定键的类型。

为此，只需替换此行：

row = CSV_Record._transform(row)

通过这个：

row.update(CSV_Record._transform(row))

“更新”函数直接更新变量行，将来自 csv 提取的原始数据与通过“ _transform ”方法转换为正确类型的值合并。

请注意，更新版本中没有“ row =”。

如果有人有类似的要求，希望这会有所帮助。

（PS：我对在stackoverflow上发帖很陌生，所以如果上述内容不清楚，请告诉我）

python - 从 CSV 文件读取数据并从字符串转换为正确的数据类型，包括整数列表列

8 回答 8

Related

Reference