This is an encoding problem, not a problem with the HTML entities. When you copy data from HTML into a text box, the browser is not pasting in the entity like –
, it's pasting in the actual character. It looks like the character you are getting is encoded in Windows-1252 (sometimes mistakenly referred to as ISO-8859-1). Since the database is expecting UTF-8, it can't handle this character.
There are a few possible reasons this might be happening. You didn't list what browser, language, web framework, or database you're using, so I'm going to offer a few suggestions, and hopefully one of them works. In general, it is best to use UTF-8 for your encoding at every stage; but if that't not possible, you either need to use a consistent encoding throughout all of the levels, or you need to convert.
Since your database is using UTF-8, I'll assume that's the encoding that you want to use. One thing to check is whether your pages are being served as UTF-8. Check the headers on your HTTP response; there should be a Content-Type: text/html; charset=utf-8
header. If that is wrong, missing, or missing the charset=utf-8
part, then the browser may choose the wrong charset. One more thing that's good to do is add a <meta charset=utf-8>
tag in your <head>
; while this isn't necessary if you have the charset sent as part of the HTTP headers, it can help select the correct charset if the headers aren't present, or the document is loaded from a file:
URL or the like, which doesn't have headers available.
While the browser should use the character set of the document when submitting the form, you can ensure that it submits using the correct charset by using the accept-charset
attribute on the form: <form accept-charset=utf-8>
. This will ensure that even if the page has the no charset set in the headers, forms will submit data as UTF-8.
Finally, even if all of that is correct, IE 5 through 8 will sometimes submit data in a different encoding than what the page is sent in, if the user has changed their encoding settings. To force it to send UTF-8 data, you can use a hidden form attribute that includes a character that cannot be encoded in a legacy encoding like Windows-1252. Some versions of Ruby on Rails famously used a snowman (☃) for this purpose, though it was later changed to a checkmark (✓) to be less puzzling. You can add a similar element to your form to force IE to use UTF-8: <input name="_utf7" type="hidden" value="✓">
.
If the above suggestions don't work, please let us know what browser, programming language, web framework, and database you are using, and try to provide a short, self-contained piece of sample code that demonstrates the problem.