python - 使用 Spark 加载 CSV 文件

Question

我是 Spark 的新手，我正在尝试使用 Spark 从文件中读取 CSV 数据。这就是我正在做的事情：

sc.textFile('file.csv')
    .map(lambda line: (line.split(',')[0], line.split(',')[1]))
    .collect()

我希望这个调用能给我一个文件前两列的列表，但我收到了这个错误：

文件“”，第 1 行，在 IndexError：列表索引超出范围

虽然我的 CSV 文件不止一列。

score 203 · Accepted Answer

火花 2.0.0+

您可以直接使用内置的 csv 数据源：

spark.read.csv(
    "some_input_file.csv", 
    header=True, 
    mode="DROPMALFORMED", 
    schema=schema
)

或者

(
    spark.read
    .schema(schema)
    .option("header", "true")
    .option("mode", "DROPMALFORMED")
    .csv("some_input_file.csv")
)

不包括任何外部依赖项。

火花 < 2.0.0：

而不是手动解析，这在一般情况下远非微不足道，我建议spark-csv：

确保 Spark CSV 包含在路径中 ( --packages, --jars, --driver-class-path)

并按如下方式加载您的数据：

df = (
    sqlContext
    .read.format("com.databricks.spark.csv")
    .option("header", "true")
    .option("inferschema", "true")
    .option("mode", "DROPMALFORMED")
    .load("some_input_file.csv")
)

它可以处理加载、模式推断、删除格式错误的行，并且不需要将数据从 Python 传递到 JVM。

注意：

如果您知道架构，最好避免架构推断并将其传递给DataFrameReader. 假设您有三列 - 整数、双精度和字符串：

from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType

schema = StructType([
    StructField("A", IntegerType()),
    StructField("B", DoubleType()),
    StructField("C", StringType())
])

(
    sqlContext
    .read
    .format("com.databricks.spark.csv")
    .schema(schema)
    .option("header", "true")
    .option("mode", "DROPMALFORMED")
    .load("some_input_file.csv")
)

score 67 · Accepted Answer

您确定所有行至少有 2 列吗？你可以尝试类似的东西，只是为了检查？：

sc.textFile("file.csv") \
    .map(lambda line: line.split(",")) \
    .filter(lambda line: len(line)>1) \
    .map(lambda line: (line[0],line[1])) \
    .collect()

或者，您可以打印罪魁祸首（如果有的话）：

sc.textFile("file.csv") \
    .map(lambda line: line.split(",")) \
    .filter(lambda line: len(line)<=1) \
    .collect()

score 38 · Accepted Answer

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

df = spark.read.csv("/home/stp/test1.csv",header=True,sep="|")

print(df.collect())

score 22 · Accepted Answer

还有一个选项是使用 Pandas 读取 CSV 文件，然后将 Pandas DataFrame 导入 Spark。

例如：

from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd

sc = SparkContext('local','example')  # if using locally
sql_sc = SQLContext(sc)

pandas_df = pd.read_csv('file.csv')  # assuming the file contains a header
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2']) # if no header
s_df = sql_sc.createDataFrame(pandas_df)

score 18 · Accepted Answer

简单地用逗号分割也会分割字段内的逗号（例如a,b,"1,2,3",c），因此不建议这样做。如果您想使用 DataFrames API， zero323 的答案很好，但如果您想坚持使用基础 Spark，您可以使用csv模块在基础 Python 中解析 csvs：

# works for both python 2 and 3
import csv
rdd = sc.textFile("file.csv")
rdd = rdd.mapPartitions(lambda x: csv.reader(x))

编辑：正如评论中提到的@muon，这会将标题视为任何其他行，因此您需要手动提取它。例如，header = rdd.first(); rdd = rdd.filter(lambda x: x != header)（确保header在过滤器评估之前不要修改）。但此时，您最好使用内置的 csv 解析器。

score 10 · Accepted Answer

这是在 PYSPARK

path="Your file path with file name"

df=spark.read.format("csv").option("header","true").option("inferSchema","true").load(path)

然后你可以检查

df.show(5)
df.count()

score 7 · Accepted Answer

如果要将 csv 作为数据框加载，则可以执行以下操作：

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

df = sqlContext.read.format('com.databricks.spark.csv') \
    .options(header='true', inferschema='true') \
    .load('sampleFile.csv') # this is your csv file

它对我来说很好。

score 6 · Accepted Answer

现在，对于任何通用 csv 文件还有另一种选择：https ://github.com/seahboonsiew/pyspark-csv ，如下所示：

假设我们有以下上下文

sc = SparkContext
sqlCtx = SQLContext or HiveContext

首先，使用 SparkContext 将 pyspark-csv.py 分发给执行者

import pyspark_csv as pycsv
sc.addPyFile('pyspark_csv.py')

通过 SparkContext 读取 csv 数据并将其转换为 DataFrame

plaintext_rdd = sc.textFile('hdfs://x.x.x.x/blah.csv')
dataframe = pycsv.csvToDataFrame(sqlCtx, plaintext_rdd)

score 6 · Accepted Answer

这与JP Mercier 最初关于使用 Pandas 的建议一致，但有一个重大修改：如果您将数据以块的形式读入 Pandas，它应该更具延展性。这意味着，您可以解析比 Pandas 实际可以作为单个文件处理的更大的文件，并将其以更小的尺寸传递给 Spark。（这也回答了关于如果他们无论如何都可以将所有内容加载到 Pandas 中为什么要使用 Spark 的评论。）

from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd

sc = SparkContext('local','example')  # if using locally
sql_sc = SQLContext(sc)

Spark_Full = sc.emptyRDD()
chunk_100k = pd.read_csv("Your_Data_File.csv", chunksize=100000)
# if you have headers in your csv file:
headers = list(pd.read_csv("Your_Data_File.csv", nrows=0).columns)

for chunky in chunk_100k:
    Spark_Full +=  sc.parallelize(chunky.values.tolist())

YourSparkDataFrame = Spark_Full.toDF(headers)
# if you do not have headers, leave empty instead:
# YourSparkDataFrame = Spark_Full.toDF()
YourSparkDataFrame.show()

score 4 · Accepted Answer

如果您的 csv 数据碰巧在任何字段中都不包含换行符，您可以加载数据textFile()并对其进行解析

import csv
import StringIO

def loadRecord(line):
    input = StringIO.StringIO(line)
    reader = csv.DictReader(input, fieldnames=["name1", "name2"])
    return reader.next()

input = sc.textFile(inputFile).map(loadRecord)

score 3 · Accepted Answer

如果数据集中任何一行或多行的列数少于或多于 2，则可能会出现此错误。

我也是 Pyspark 的新手，正在尝试读取 CSV 文件。以下代码对我有用：

在这段代码中，我使用来自 kaggle 的数据集，链接是：https ://www.kaggle.com/carrie1/ecommerce-data

1.不提架构：

from pyspark.sql import SparkSession  
scSpark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example: Reading CSV file without mentioning schema") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

sdfData = scSpark.read.csv("data.csv", header=True, sep=",")
sdfData.show()

现在检查列： sdfData.columns

输出将是：

['InvoiceNo', 'StockCode','Description','Quantity', 'InvoiceDate', 'CustomerID', 'Country']

检查每一列的数据类型：

sdfData.schema
StructType(List(StructField(InvoiceNo,StringType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,StringType,true),StructField(InvoiceDate,StringType,true),StructField(UnitPrice,StringType,true),StructField(CustomerID,StringType,true),StructField(Country,StringType,true)))

这将为数据框提供所有数据类型为 StringType 的列

2. 使用模式： 如果您知道模式或想要更改上表中任何列的数据类型，请使用它（假设我有以下列并希望它们中的每一个具有特定的数据类型）

from pyspark.sql import SparkSession  
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType
    schema = StructType([\
        StructField("InvoiceNo", IntegerType()),\
        StructField("StockCode", StringType()), \
        StructField("Description", StringType()),\
        StructField("Quantity", IntegerType()),\
        StructField("InvoiceDate", StringType()),\
        StructField("CustomerID", DoubleType()),\
        StructField("Country", StringType())\
    ])

scSpark = SparkSession \
    .builder \
    .appName("Python Spark SQL example: Reading CSV file with schema") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

sdfData = scSpark.read.csv("data.csv", header=True, sep=",", schema=schema)

现在检查每列的数据类型的架构：

sdfData.schema

StructType(List(StructField(InvoiceNo,IntegerType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,IntegerType,true),StructField(InvoiceDate,StringType,true),StructField(CustomerID,DoubleType,true),StructField(Country,StringType,true)))

编辑：我们也可以使用以下代码行，而无需明确提及架构：

sdfData = scSpark.read.csv("data.csv", header=True, inferSchema = True)
sdfData.schema

输出是：

StructType(List(StructField(InvoiceNo,StringType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,IntegerType,true),StructField(InvoiceDate,StringType,true),StructField(UnitPrice,DoubleType,true),StructField(CustomerID,IntegerType,true),StructField(Country,StringType,true)))

输出将如下所示：

sdfData.show()

+---------+---------+--------------------+--------+--------------+----------+-------+
|InvoiceNo|StockCode|         Description|Quantity|   InvoiceDate|CustomerID|Country|
+---------+---------+--------------------+--------+--------------+----------+-------+
|   536365|   85123A|WHITE HANGING HEA...|       6|12/1/2010 8:26|      2.55|  17850|
|   536365|    71053| WHITE METAL LANTERN|       6|12/1/2010 8:26|      3.39|  17850|
|   536365|   84406B|CREAM CUPID HEART...|       8|12/1/2010 8:26|      2.75|  17850|
|   536365|   84029G|KNITTED UNION FLA...|       6|12/1/2010 8:26|      3.39|  17850|
|   536365|   84029E|RED WOOLLY HOTTIE...|       6|12/1/2010 8:26|      3.39|  17850|
|   536365|    22752|SET 7 BABUSHKA NE...|       2|12/1/2010 8:26|      7.65|  17850|
|   536365|    21730|GLASS STAR FROSTE...|       6|12/1/2010 8:26|      4.25|  17850|
|   536366|    22633|HAND WARMER UNION...|       6|12/1/2010 8:28|      1.85|  17850|
|   536366|    22632|HAND WARMER RED P...|       6|12/1/2010 8:28|      1.85|  17850|
|   536367|    84879|ASSORTED COLOUR B...|      32|12/1/2010 8:34|      1.69|  13047|
|   536367|    22745|POPPY'S PLAYHOUSE...|       6|12/1/2010 8:34|       2.1|  13047|
|   536367|    22748|POPPY'S PLAYHOUSE...|       6|12/1/2010 8:34|       2.1|  13047|
|   536367|    22749|FELTCRAFT PRINCES...|       8|12/1/2010 8:34|      3.75|  13047|
|   536367|    22310|IVORY KNITTED MUG...|       6|12/1/2010 8:34|      1.65|  13047|
|   536367|    84969|BOX OF 6 ASSORTED...|       6|12/1/2010 8:34|      4.25|  13047|
|   536367|    22623|BOX OF VINTAGE JI...|       3|12/1/2010 8:34|      4.95|  13047|
|   536367|    22622|BOX OF VINTAGE AL...|       2|12/1/2010 8:34|      9.95|  13047|
|   536367|    21754|HOME BUILDING BLO...|       3|12/1/2010 8:34|      5.95|  13047|
|   536367|    21755|LOVE BUILDING BLO...|       3|12/1/2010 8:34|      5.95|  13047|
|   536367|    21777|RECIPE BOX WITH M...|       4|12/1/2010 8:34|      7.95|  13047|
+---------+---------+--------------------+--------+--------------+----------+-------+
only showing top 20 rows

score 3 · Accepted Answer

使用时spark.read.csv，我发现使用选项escape='"'并为CSV 标准multiLine=True提供最一致的解决方案，并且根据我的经验，最适合从 Google 表格导出的 CSV 文件。

那是，

#set inferSchema=False to read everything as string
df = spark.read.csv("myData.csv", escape='"', multiLine=True,
     inferSchema=False, header=True)

score -1 · Accepted Answer

以这种方式读取您的 csv 文件：

df= spark.read.format("csv").option("multiline", True).option("quote", "\"").option("escape", "\"").option("header",True).load(df_path)

火花版本是 3.0.1

python - 使用 Spark 加载 CSV 文件

13 回答 13

Related

Reference