Just use a HTML parser anyway. Python comes with a few included, and the xml.etree.ElementTree
API is easier to get working than a regular expression for even simple <a>
tags with arbitrary attributes:
from xml.etree import ElementTree as ET
texts = []
for linktext in linkslist:
link = ET.fromstring(linktext)
texts.append([link.attrib['href'], link.text])
If you use ' '.join(link.itertext())
you can get the text out of anything nested under the <a>
tag, if you find that some of the links have nested <span>
, <b>
, <i>
or other inline tags to mark up the link text further:
for linktext in linkslist:
link = ET.fromstring(linktext)
texts.append([link.attrib['href'], ' '.join(link.itertext())])
This gives:
>>> from xml.etree import ElementTree as ET
>>> linkslist = ['<a href="needs to be cut out">Foo to BAR</a>', '<a href="this also needs to be cut out">BAR to Foo</a>']
>>> texts = []
>>> for linktext in linkslist:
... link = ET.fromstring(linktext)
... texts.append([link.attrib['href'], ' '.join(link.itertext())])
...
>>> texts
[['needs to be cut out', 'Foo to BAR'], ['this also needs to be cut out', 'BAR to Foo']]