Overview
You need to wrap a bytestream and escape specific values. Also, the other way around is required: unescape control-codes and get the raw payload. You are working with sockets. The socket-commands uses string-parameters. In python, every string is basically a wrapper around a char*
-array.
Naive approach
Its a string and we want to replace specific values with other ones. So what is the simplest way to achieve this?
def unstuff(self, s):
return s.replace('\xFE\xDC', '\xFC').replace('\xFE\xDD', '\xFE').replace('\xFE\xDE', '\xFE')
def stuff(self, s):
return s.replace('\xFC', '\xFE\xDC').replace('\xFD', '\xFE\xDD').replace('\xFE', '\xFE\xDE')
Seems to be bad. With every replace-call, a new string-copy will be created.
Iterator
A very pythonic approach is to define an iterator for this specific problem: define the iterator to transform the input-data into the desired output.
def unstuff(data):
i = iter(data)
dic = {'\xDC' : '\xFC', '\xDD' : '\xFD', '\xFE' : '\xDE'}
while True:
d = i.next() # throws StopIteration on the end
if d == '\xFE':
d2 = i.next()
if d2 in dic:
yield dic[d2]
else:
yield '\xFE'
yield d2
else:
yield d
def stuff(data):
i = iter(data)
dic = { '\xFC' : '\xDC', '\xFD' : '\xDD', '\xFE' : '\xDE' }
while True:
d = i.next() # throws StopIteration on the end
if d in dic:
yield '\xFE'
yield dic[d]
else:
yield d
def main():
s = 'hello\xFE\xDCWorld'
unstuffed = "".join(unstuff(s))
stuffed = "".join(stuff(unstuffed))
print s, unstuffed, stuffed
# also possible
for c in unstuff(s):
print ord(c)
if __name__ == '__main__':
main()
stuff()
and unstuff()
need something iterable (list, string, ...) and return an iterator-object. If you want to print
the result or pass it into socket.send
, you need to convert it back to a string (as shown with "".join()
). Every unexpected data is handled somehow: 0xFE 0x__
will be returned verbatim, if it does not match any pattern.
RegExp
Another way would be to use regular expressions. Its a big topic and a source of trouble sometimes, but we can keep it simple:
import re
s = 'hello\xFE\xDCWorld' # our test-string
# read: FE DC or FE DD or FE DE
unstuff = re.compile('\xFE\xDC|\xFE\xDD|\xFE\xDE')
# read:
# - use this pattern to match against the string
# - replace what you have found (m.groups(0), whole match) with
# char(ord(match[1])^0x20)
unstuffed = unstuff.sub(lambda m: chr(ord(m.group(0)[1])^0x20), s)
# same thing, other way around
stuff = re.compile('\xFC|\xFD|\xFE')
stuffed = stuff.sub(lambda m: '\xFE' + chr(ord(m.group(0))^0x20), unstuffed)
print s, unstuffed, stuffed
As said, you must create the new string somewhere to be able to use it with sockets. At least, this approach do not create unnecessary copies of the string like s.replace(..).replace(..).replace(..)
would. You should keep the patterns stuff
and unstuff
somewhere around as building these objects is relatively expensive.
native C-Function
If some things are going to slow in python, we might want to use cpython and implement it as oure C-code. Basically, I do a first run, count how many bytes I nerd, allocate a new string and do the second run. I'm not very used to python-c-extensions and so I do not want to share this code. It just seems to work, see the next chapter
Comparison
One of the most important rules of optimization: compare! The basic setup for every test:
generate random binary data as a string
while less_than_a_second:
unstuff(stuff(random_data))
count += 1
return time_needed / count
I know, the setup isn't optimal. But we should get some usable result:
What do we see? Native is the fastest way to go, but only for very small strings. This is probably because of the python-interpreter: only one function-call is needed instead of three. But microseconds is fast enough the most of the times. After ~500 bytes, the timings are nearly the same with the naive approach. There must be some deep magic happening down there in the implementation. Iterators and RegExp are unacceptable compared to the effort.
To sum things up: use the naive approach. Its hard to get something better. Also: if you simply guess about timings, you will be almost always wrong.