(I am aware that regex is not the recommended way to deal with html, but this is my assignment)
I need a regex in Java that will capture html tags and their attributes. I am trying to achieve this with one regex using groups. I expected this regex to work:
<(?!!)(?!/)\s*(\w+)(?:\s*(\S+)=['"]{1}[^>]*?['"]{1})*\s*>
< the tag starts with <
(?!!) I dont't want comments
(?!/) I dont't want closing tags
\s* any number of white spaces
(\w+) the tag
(?: do not capture the following group
\s* any number of white spaces before the first attribute
(\S+) capture the attributes name
=['"]{1}[^>]*?['"]{1} the ="bottm" or ='bottm' etc.
)* close the not-capturing group, it can occure multiple times or zero times
\s* any white spaces before the closing of the tag
> close the tag
I expected the result for a tag like:
<div id="qwerty" class='someClass' >
group(1) = "div"
group(2) = "id"
group(3) = "class"
but the result is:
group(1) = "div"
group(2) = "class"
Is seems that it is not possible to capture a group multiple times (...)*, is this correct?
As for now I use a reg ex like:
<(?!!)(?!/)\s*(\w+) (?:\s*(\S+)=['"]{1}[^>]*?['"]{1}){0,1} (?:\s*(\S+)=['"]{1}[^>]*?['"]{1}){0,1} (...){0,1} (...){0,1} ... \s*>
I repeat the capturing group for the attribute multiple times and get results like:
<div id="qwerty" class='someClass' >
group(1) = "div"
group(2) = "id"
group(3) = "class"
group(4) = null
group(5) = null
group(6) = null
...
What other approaches can I use? (I can use multiple regexes, but it is preferred to do it with just one)