python - 使用 python 使用 hadoop 处理 xml 文件

Question

我正在使用 python 和 hadoop 来处理 xml 文件，我有以下格式的 xml 文件

临时.xml

<report>
<report-name name="ALL_TIME_KEYWORDS_PERFORMANCE_REPORT"/>
<date-range date="All Time"/>
  <table>
    <columns>
       <column name="campaignID" display="Campaign ID"/>
       <column name="adGroupID" display="Ad group ID"/>
       <column name="keywordID" display="Keyword ID"/>
       <column name="keyword" display="Keyword"/>
    </columns>
    <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
    <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
    <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
    <row campaignID="79057390" adGroupID="3451305670" keywordID="3000000" keyword="Content"/>
  </table>
</report>

现在我要做的就是处理上面的 xml 文件，然后将数据保存到 MSSQL 数据库中。

mapper.py代码

import sys
import cStringIO
import xml.etree.ElementTree as xml

if __name__ == '__main__':
    buff = None
    intext = False
    for line in sys.stdin:
        line = line.strip()
        if line.find("<row>") != -1:
            intext = True
            buff = cStringIO.StringIO()
            buff.write(line)
        elif line.find("</>") != -1:
            intext = False
            buff.write(line)
            val = buff.getvalue()
            buff.close()
            buff = None
            print val

在这里，我要做的就是从中获取数据并打印它们row tags的值，campaignID,adgroupID,keywordID,keyword然后将它们作为输入reducer.py（包括将数据保存在数据库中的代码）。

我看过一些例子，但标签就像<tag> </tag>，但就我而言，我只有<row/>

但是我上面的代码不起作用/不打印任何东西，任何人都可以更正我的代码并添加必要的python代码来从行标签中获取值/数据（我对hadoop非常非常陌生），这样可以扩展下次的代码。

score 0 · Accepted Answer

您是否考虑过使用 xpath？它是一种迷你语言，可用于绕过 xml 树。它可以在 python 中轻松使用。

http://docs.python.org/2/library/xml.etree.elementtree.html可能对你有用

您可能还想查看在 ElementTree 中使用 XPath 的需要帮助

这是我的做法（这是有效的 Python 代码。我在 Python3.2 中对其进行了测试。适用于您的示例 xml）：

import xml.etree.ElementTree as xml #you had this line in your code. I am not using any tool you  do not have access to in your script

def get_row_attributes(the_xml_as_a_string):
    """
    this function takes xml as a string. 
    It can work with xml that looks like your included example xml.
    This function returns a list of dictionaries. Each dictionary is made up of the attributes of each row. So the result looks like:
     [
          {attribute_name:value_for_first_row,attribute_name:value_for_first_row...},
          {attribute_name:value_for_second_row,attribute_name:value_for_second_row...},
          etc
     ]
    """
    tree = xml.fromstring(the_xml_as_a_string)
    rows = tree.findall('table/row')  # 'table/row' is xpath. it means get all the rows in all the tables
    return [row.attrib for row in rows]

要使用此函数，请读取 std in 并构建一个字符串。称呼get_row_attributes(the_xml_as_a_string)

生成的字典包含您请求的信息（行的属性）。

所以现在我们有

从标准输入读取内容
获取所有行的所有信息

全部使用完全正常的python

最后要做的是将其写入您的其他进程。如果您需要这部分的帮助，请提供有关数据应该采用什么格式以及应该去哪里的信息

python - 使用 python 使用 hadoop 处理 xml 文件

1 回答 1

Related

Reference