python - Python插入不需要的字符

Question

我正在尝试使用 Python 生成一组 HTML 表，其中包含从 CSV 中提取的值。该脚本运行良好，但是它会在任何拉入值的地方添加奇数的“¬†”字符。

这是我用来获取 CSV 数据的代码：

import csv
import fileinput
import re

out=open("audiencestats.csv","rU")
data=csv.reader(out)
values =[row for row in data]
metrics = values.pop(0) 
out.close()

这将创建一个函数来制作 html 表：

def maketable(leftmetric, rightmetric, leftvalue, rightvalue):
  template = '''
  <table width="99%%" border="1"> 
   <tbody>
    <tr>
    <td align="center" valign="middle">
    <h3>%s</h3>
    </td>
    <td align="center" valign="middle">
    <h3>%s</h3>
    </td>
    </tr>
    <tr>
    <td align="center" valign="middle"> %s</td>
    <td align="center" valign="middle"> %s</td>
    </tr>
    </tbody>
  </table>
  '''
  file.write(template % (leftmetric, rightmetric, leftvalue, rightvalue))

然后将表写入文本文件：

for i in values:
  filename = "%s.txt" % i[0]
  file = open(filename , 'w')
  file.write(header)
  maketable(metrics[1],metrics[2],i[1],i[2])
  maketable(metrics[3],metrics[4],i[3],i[4])
  maketable(metrics[5],metrics[6],i[5],i[6])
  maketable(metrics[7],metrics[8],i[7],i[8])
  maketable(metrics[9],metrics[10],i[9],i[10])
  maketable(metrics[11],metrics[12],i[11],i[12])
  file.write(header2)
  print makesocial(i[13],i[14],i[15])
  file.close()

我尝试将下面的 re.sub 添加到 for 循环中，但十字架仍然存在。

for line in fileinput.input(inplace=1):
    line = re.sub('¬†','', line.rstrip())
    print(line)

我错过了什么吗？我的电脑变成宗教了吗？

下面复制的输出示例：

<h1>Audience</h1>
  <table width="99%" border="1"> 
   <tbody>
    <tr>
    <td align="center" valign="middle">
    <h3>UVs (000)</h3>
    </td>
    <td align="center" valign="middle">
    <h3>PVs (000)</h3>
    </td>
    </tr>
    <tr>
    <td align="center" valign="middle">¬†580.705</td>
    <td align="center" valign="middle">¬†1003</td>
    </tr>
    </tbody>
  </table>

score 0 · Accepted Answer

0

于 2013-06-25T21:21:37.297 回答

score 0 · Accepted Answer

There's nothing wrong with your data—it's pure ASCII. The problem is in your source code.

Clicking the Edit button to copy your actual source, rather than your formatted source, it's got non-breaking space (U+00A0) characters in the middle of the template string literal.

Assuming your editor and the browser you copied from and pasted to are doing things right, that means that your actual UTF-8 source has '\xc2\xa0' sequences.

Since you're putting non-ASCII characters into a str/bytes literal (which, as I explained in the other answer, is always a bad idea), this means your strings end up with '\xc2\xa0' sequences.

Somewhere between there and your screen, there's an additional coding problem, and this is getting garbled into '\xc2\xac\xe2\x80\xa0' sequences—which, when interpreted as UTF-8, show up as u'¬†'.

We could try to track down where that additional problem is coming from, but it doesn't matter too much.

The immediate fix is to replace all the non-breaking spaces in your source with plain ASCII spaces.

Going beyond that, you need to figure out what you were using that generated these non-breaking spaces. Often, this is a sign of editing source code in word processors rather than text editors; if so, stop doing that.

If you don't actually have any intentionally-non-ASCII source code, using # coding=ascii instead of # coding=utf-8 at the top of your file is a great way to catch bugs like this. (You can still process UTF-8 values; all the coding declaration says is that the source code itself is in UTF-8.)

score -1 · Accepted Answer

试试这个：

line = re.sub(r'(?u)¬†','', line.rstrip())

然后正则表达式将您的字符串视为 unicode。

python - Python插入不需要的字符

3 回答 3

Related

Reference