我正在提取 xml 文档中的所有文本。我想查找标签说描述,然后搜索所有子孙,可能会有更多元素,然后提取文本。
这是我的代码,但它无法在孙子标签中获取文本:
for element in root.find('description'):
print 'parent: ', element.tag, '|', element.attrib
try:
data.write(element.text)
for all_tags in element.findall('./'):
print 'child: ', all_tags.tag, '|', all_tags.attrib
if all_tags.text:
data.write('\n')
data.write(all_tags.text)
if all_tags.tail:
data.write('\n')
data.write(all_tags.tail)
data.write('\n')
data.write('\n')
except TypeError:
pass
except UnicodeEncodeError:
unicodestr = element.text.encode("utf-8")
data.write(unicodestr)
data.write('\n')
问题出在for all_tags
循环中。
样本输入:
<description>
<p num="p-0003">
Protein kinases are involved in the signal transduction pathways linking growth factors, hormones and other cell regulation molecules to cell growth, survival and metabolism under both normal and pathological conditions. One such protein kinase, protein kinase B (also known as Akt), is a serine/threonine kinase that plays a central role in promoting the proliferation and survival of a wide range of cell types, thereby protecting cells from apoptosis (programmed cell death) (Khwaja,
<i>Nature</i>
33-34 (1990)). Three members of the Akt/PKB subfamily of second-messenger regulated serine/threonine protein kinases have been identified and are termed Akt1/PKBα, Akt2/PKBβ, and Akt3/PKBγ. A number of proteins involved in cell proliferation and survival have been described as substrates of Akt in cells. Two examples of such substrates include glycogen synthase kinase-3 (GSK3) and Forkhead transcription factors (FKs). See Brazil and Hemmings,
<i>Trends in Biochemical Sciences</i>
26, 675-664.
</p>
<p num="p-0004">
A number of protein kinases and phosphatases regulate the activity of Akt; For instance, activation of Akt is mediated by phosphatidylinositol 3-kinase (PI3-K), which initiates the binding of second messenger phospholipids to the pleckstrin homology (PH) binding domain of Akt. The binding anchors Akt to plasma membrane and results in phosphorylation and activation of the enzyme. Amplifications of the catalytic subunit of PI3-K, p110α, or mutations in the PI3-K regulatory subunit, p85α, lead to activation of Akt in several types of human cancer. (Vivanco and Sawyers,
<i>Nature Reviews in Cancer</i>
(2002) 2: 489-501.
</p>
<p num="p-0005">
The tumor suppressor, PTEN, is a critical negative regulator of Akt activation by PI3-K. Myers et al.
</p>
</description>
在此输入中,后面的文本<i> Nature </i>
被遗漏并替换为第一行中的文本。我相信这是因为all_tags.tail
从父标签而不是子标签和孙标签中获取文本。