我正在使用hadoop处理一个xml文件,所以我在python中编写了mapper文件,reducer文件。
假设需要处理的输入是test.xml
<report>
<report-name name="ALL_TIME_KEYWORDS_PERFORMANCE_REPORT"/>
<date-range date="All Time"/>
<table>
<columns>
<column name="campaignID" display="Campaign ID"/>
<column name="adGroupID" display="Ad group ID"/>
</columns>
<row campaignID="79057390" adGroupID="3451305670"/>
<row campaignID="79057390" adGroupID="3451305670"/>
</table>
</report>
映射器.py文件
import sys
import cStringIO
import xml.etree.ElementTree as xml
if __name__ == '__main__':
buff = None
intext = False
for line in sys.stdin:
line = line.strip()
if line.find("<row") != -1:
.............
.............
.............
print '%s\t%s'%(campaignID,adGroupID )
reducer.py文件
import sys
if __name__ == '__main__':
for line in sys.stdin:
print line.strip()
我已经使用以下命令运行了 hadoop
bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar
- file /path/to/mapper.py file -mapper /path/to/mapper.py file
-file /path/to/reducer.py file -reducer /path/to/reducer.py file
-input /path/to/input_file/test.xml
-output /path/to/output_folder/to/store/file
当我运行上述命令时,hadoop 正在以我们在reducer.py
文件中正确使用所需数据的格式在输出路径创建输出文件
现在毕竟我想做的是,当我运行上面的命令时,我不想将输出数据存储在由 haddop 默认创建的文本文件中,而是我想将数据保存到MYSQL
数据库中
所以我在reducer.py
文件中编写了一些python代码,将数据直接写入MYSQL
数据库,并尝试通过删除输出路径来运行上述命令,如下所示
bin/hadoop jar contrib/streaming/hadoop-streaming-1.0.4.jar
- file /path/to/mapper.py file -mapper /path/to/mapper.py file
-file /path/to/reducer.py file -reducer /path/to/reducer.py file
-input /path/to/input_file/test.xml
我收到如下错误
12/11/08 15:20:49 ERROR streaming.StreamJob: Missing required option: output
Usage: $HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/hadoop-streaming.jar [options]
Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-combiner <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
-outputformat TextOutputFormat(default)|JavaClassName Optional.
.........................
.........................
- 毕竟我的疑问是如何
Database
在处理文件后保存数据? - 我们可以在哪个文件(mapper.py/reducer.py?)中编写将数据写入数据库的代码
- 哪个命令用于运行 hadoop 以将数据保存到数据库中,因为当我在 hadoop 命令中删除输出文件夹路径时,它显示错误。
谁能帮我解决上述问题......
已编辑
处理后
如上所述创建
mapper
和文件,它读取 xml 文件并在某个文件夹中创建一个文本文件reducer
hadoop command
例如:文本文件(使用hadoop命令处理xml文件的结果)所在的文件夹如下
/home/local/user/Hadoop/xml_processing/xml_output/part-00000
这里的 xml 文件大小是1.3 GB
,在使用 hadoop 处理后,text file
创建的大小是345 MB
现在我想做的就是reading the text file in the above path and saving data to the mysql database
尽可能快。
我已经用基本的python尝试过这个,但是需要一些350 sec
处理文本文件并保存到mysql数据库。
现在正如nichole所指示的那样,下载了sqoop并在下面的某个路径上解压缩
/home/local/user/sqoop-1.4.2.bin__hadoop-0.20
并进入bin
文件夹并输入./sqoop
,我收到以下错误
sh-4.2$ ./sqoop
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.
Try 'sqoop help' for usage.
我也试过下面
./sqoop export --connect jdbc:mysql://localhost/Xml_Data --username root --table PerformaceReport --export-dir /home/local/user/Hadoop/xml_processing/xml_output/part-00000 --input-fields-terminated-by '\t'
结果
Warning: /usr/lib/hbase does not exist! HBase imports will fail.
Please set $HBASE_HOME to the root of your HBase installation.
Warning: $HADOOP_HOME is deprecated.
12/11/27 11:54:57 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
12/11/27 11:54:57 INFO tool.CodeGenTool: Beginning code generation
12/11/27 11:54:57 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
java.lang.RuntimeException: Could not load db driver class: com.mysql.jdbc.Driver
at org.apache.sqoop.manager.SqlManager.makeConnection(SqlManager.java:636)
at org.apache.sqoop.manager.GenericJdbcManager.getConnection(GenericJdbcManager.java:52)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:525)
at org.apache.sqoop.manager.SqlManager.execute(SqlManager.java:548)
at org.apache.sqoop.manager.SqlManager.getColumnTypesForRawQuery(SqlManager.java:191)
at org.apache.sqoop.manager.SqlManager.getColumnTypes(SqlManager.java:175)
at org.apache.sqoop.manager.ConnManager.getColumnTypes(ConnManager.java:262)
at org.apache.sqoop.orm.ClassWriter.getColumnTypes(ClassWriter.java:1235)
at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1060)
at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:82)
at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:64)
at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:97)
at org.apache.sqoop.Sqoop.run(Sqoop.java:145)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:181)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:220)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:229)
at org.apache.sqoop.Sqoop.main(Sqoop.java:238)
at com.cloudera.sqoop.Sqoop.main(Sqoop.java:57)
上述 sqoop 命令对于读取文本文件并保存到数据库的功能是否有用?,因为我们必须从文本文件中处理并插入到数据库中!!!