apache-spark - Pyspark 日志记录：在错误的日志级别打印信息

Question

谢谢你的时间！

在调试我的代码时，我想创建我的（大量）数据的清晰摘要并将其打印到我的输出中，但是一旦完成就停止创建和打印这些摘要以加快速度。有人建议我使用我实施的日志记录。它可以按预期将文本字符串作为消息打印到输出 - 但是在打印数据帧的摘要时，它似乎忽略了日志级别，始终创建它们并打印它们。

记录使用的权利还是有更好的方法来做到这一点？我可以#block 代码行或使用 if 语句等，但它是一个庞大的代码，我知道将来随着更多元素的添加，我需要进行相同的检查 - 看起来就像日志记录应该工作的那样。

from pyspark.sql.functions import col,count
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

df = spark.createDataFrame([(1,2),(3,4)],["COLA","COLB"])

print "1"
logger.setLevel(logging.DEBUG)
logger.debug("1 - DEBUG - Print the message and show the table")
logger.debug(df.show())

print "2"
logger.setLevel(logging.INFO)
logger.debug("2 - INFO - Don't print the message or show the table")
logger.debug(df.show())

print "3"
logger.setLevel(logging.INFO)
logger.debug("3 - INFO - Don't print the message or show the collected data")
logger.debug(df.collect())

print "4"
logger.setLevel(logging.DEBUG)
logger.debug("4 - DEBUG - Print the message and the collected data")
logger.debug(df.collect())

输出：

1
DEBUG:__main__:1 - DEBUG - Print the message and show the table
+----+----+
|COLA|COLB|
+----+----+
|   1|   2|
|   3|   4|
+----+----+
DEBUG:__main__:None
2
+----+----+
|COLA|COLB|
+----+----+
|   1|   2|
|   3|   4|
+----+----+
3
4
DEBUG:__main__:4 - DEBUG - Print the message and the collected data
DEBUG:__main__:[Row(COLA=1, COLB=2), Row(COLA=3, COLB=4)]

score 0 · Accepted Answer

日志按预期工作，如果我们使用df.show()（或）df.collect()是作为 spark 执行的动作，那么即使它们在logger.debug.

如果我们将日志级别设置为DEBUG然后我们可以看到INFO级别日志。
如果我们将日志级别设置为，INFO那么我们就看不到DEBUG级别日志。

您可以做的一种解决方法是将collect()/take(n)结果存储到变量中，然后在日志记录中使用该变量。

from pyspark.sql.functions import col,count
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

df = spark.createDataFrame([(1,2),(3,4)],["COLA","COLB"])

#storing results but don't use collect on huge dataset instead use `.take`
res=df.collect() 

#get 10 records from df
res=df.take(10)

print "1"
#1
logger.setLevel(logging.DEBUG)
logger.debug("1 - DEBUG - Print the message and show the table")
#DEBUG:__main__:1 - DEBUG - Print the message and show the table
logger.debug(res)
#DEBUG:__main__:[Row(COLA=1, COLB=2), Row(COLA=3, COLB=4)]

print "2"
#2
logger.setLevel(logging.INFO)
logger.debug("2 - INFO - Don't print the message or show the table")
logger.debug(res) #this won't print as loglevel is INFO.
logger.info("result: " + str(res)) #this will get printed out
#INFO:__main__:result: [Row(COLA=1, COLB=2), Row(COLA=3, COLB=4)]

使用.take而不是.collect().

apache-spark - Pyspark 日志记录：在错误的日志级别打印信息

1 回答 1

Related

Reference