1

I'm working on a tool that parses python source code into a nice html file. Basically, it read a python file line by line, looks at the line to determine what's in it and then adds the right <span> tags with colors, line breaks and whatnot.

I got the general structure of the program, now I'm making all the functions that actually read a string and return an HTML enriched string.

I'm stuck on parsing strings that have quotes in them ie.:

x = 'hello there'  
if x == 'example "quotes" inside quotes' and y == 'another example':    

My work so far has been enumerating a string to get the indices of single-quotes, return them as a list and then two while loops that put the right html tags in the right places. It seemed to work fine when there was a single quote in the string, but all hell broke loose when I introduced two quotes on a line, or quotes inside quotes or finally - a string made up of '\''.

It seems this route is a dead end. I'm now thinking of turning to .split(), shlex, or re and breaking down the string into a list and trying to work with that.
I would really appreciate tips, pointers, and any advice.

Edit: Also, to make it clearer, I need to put HTML tags in the right places in a string. Working with string indices didn't give much results with more complex strings.

4

3 回答 3

1

Colorize Python source using the built-in tokenizer is an example for this kind of code (which uses cgi.escape). See if it fits your needs!

于 2012-09-17T22:26:47.863 回答
1

You could use tokenize.generate_tokens:

import tokenize
import token
import io

text = '''
x = 'hello there'  
if x == 'example "quotes" inside quotes' and y == 'another example': pass
'''


tokens = tokenize.generate_tokens(io.BytesIO(text).readline)
for toknum, tokval, (srow, scol), (erow, ecol), line in tokens:
    tokname = token.tok_name[toknum]
    print(tokname, tokval)

yields

('NL', '\n')
('NAME', 'x')
('OP', '=')
('STRING', "'hello there'")
('NEWLINE', '\n')
('NAME', 'if')
('NAME', 'x')
('OP', '==')
('STRING', '\'example "quotes" inside quotes\'')
('NAME', 'and')
('NAME', 'y')
('OP', '==')
('STRING', "'another example'")
('OP', ':')
('NAME', 'pass')
('NEWLINE', '\n')
('ENDMARKER', '')

From here, you can output appropriate HTML based on the type (tokname) of each token.

于 2012-09-17T22:41:43.777 回答
0

Something like cgi.escape is probably what you want. There are also tools like BeautifulSoup and Pygments that do something every similar to what you're making, you may want to leverage them.

于 2012-09-17T22:30:37.323 回答