rhadoop - RHDFS 输出中的字符串字符

Question

rhdfs 中的 hdfs.write() 命令创建一个带有前导非 unicode 字符的文件。该文档没有描述正在写入的文件类型。

重新创建的步骤。1.打开R，初始化rhdfs

> ofile = hdfs.file("brian.txt", "w")
> hdfs.write("hi",ofile)
> hdfs.close(ofile)

创建一个名为“brian.txt”的文件，我可以预期它包含一个字符串“hi”。但这在开始时揭示了额外的特征。

> hdfs dfs -cat brian.txt
X
    hi

我不知道创建了什么文件类型并且 rhdfs 不显示任何文件类型选项。这使得输出非常难以使用。

score 3 · Accepted Answer

如果您查看源代码中的 hdfs.write 函数，您会发现它可以获取原始字节，而不是让 R 为您序列化它。所以基本上你可以为角色做这个

ofile = hdfs.file("brian.txt", "w")
hdfs.write(charToRaw("hi", ofile))
hdfs.close(ofile)

score 1 · Accepted Answer

如果您直接创建/写入，Hadoop 默认会序列化对象，因此您会在文件中看到额外的字符。但是，当您使用copyFromLocal.

序列化是将结构化对象转换为字节流的过程。它基本上有两个目的：1）通过网络传输（进程间通信）。2) 用于写入持久存储。

您可以使用以下 R 代码反序列化 hadoop 对象：

hfile = hdfs.file("brian.txt", "r") # read from hdfs
file <- hdfs.read(hfile) 
file <- unserialize(file) # deserialize to remove special characters
hdfs.close(hfile)

如果您计划从 R 创建文件，但不会通过 R 读取，那么避免特殊字符的解决方法是将内容保存到本地文件并将文件移动到 hdfs。下面是R代码：

# Set environment path and load library
Sys.setenv("HADOOP_CMD"="/usr/local/hadoop/bin/hadoop")
library(rhdfs)
hdfs.init()  # Initialize

text <- "Hi, This is a sample text."
SaveToLocalPath <- "/home/manohar/Temp/outfile.txt"
writeLines(text, SaveToLocalPath) # write content to local file
hdfs.put(SaveToLocalPath, "/tmp") # Copy file to hdfs
file.remove(SaveToLocalPath) # Delete from local

rhadoop - RHDFS 输出中的字符串字符

2 回答 2

Related

Reference