python-2.7 - Removing Non Unicode characters from a file

Question

I know this is repeated question but I have really tried hard all of the solutions so far. Can anyone please help how to get rid of chacracters like \xc3\xa2\xc2\x84\xc2\xa2 from a file?

The file content which I am trying to clean currently is: b'Roasted Onion Dip',"b""['2 pounds large yellow onions, thinly sliced', '3 large shallots, thinly sliced', '4 sprigs thyme', '1/4 cup olive oil', 'Kosher salt and freshly ground black pepper', '1 cup white wine', '2 tablespoons champagne vinegar', '2 cups sour cream', '1/2 cup chopped fresh chives', '1/4 cup plain Greek yogurt', 'Everything seasoning and thyme to garnish', 'Cape Cod Waves\xc3\xa2\xc2\x84\xc2\xa2 Potato Chips for serving']"""

I have tried using re.sub('[^\x00-\x7F]+',' ',whatevertext) but can't seem to get anywhere. I suspect that \ here is not being treated as a special character.

score 1 · Accepted Answer

你可以这样做：

>>> f = open("test.txt","r")
>>> whatevertext = f.read()
>>> print whatevertext
b'Roasted Onion Dip',"b""['2 pounds large yellow onions, thinly sliced', '3 large shallots, thinly sliced', '4 sprigs thyme', '1/4 cup olive oil', 'Kosher salt and freshly ground black pepper', '1 cup white wine', '2 tablespoons champagne vinegar', '2 cups sour cream', '1/2 cup chopped fresh chives', '1/4 cup plain Greek yogurt', 'Everything seasoning and thyme to garnish', 'Cape Cod Waves\xc3\xa2\xc2\x84\xc2\xa2 Potato Chips for serving']"""

>>> import re
>>> result = re.sub('\\\\x[a-f|0-9]+','',whatevertext)
>>> print result
b'Roasted Onion Dip',"b""['2 pounds large yellow onions, thinly sliced', '3 large shallots, thinly sliced', '4 sprigs thyme', '1/4 cup olive oil', 'Kosher salt and freshly ground black pepper', '1 cup white wine', '2 tablespoons champagne vinegar', '2 cups sour cream', '1/2 cup chopped fresh chives', '1/4 cup plain Greek yogurt', 'Everything seasoning and thyme to garnish', 'Cape Cod Waves Potato Chips for serving']"""

>>>

'\\x[af|0-9]+' 在这个正则表达式中，每个斜杠都用斜杠转义，在 x 之后，我们知道可以有 0-9 的数字或 af 的字母。

python-2.7 - Removing Non Unicode characters from a file

1 回答 1

Related

Reference