python-2.7 - Tokenizing in french using nltk

Question

I am trying to tokenize french words but when i tokenize french words the words which contain "^" symbol returns \xe .The following is the code that i implemented .

import nltk
from nltk.tokenize import WhitespaceTokenizer
from nltk.tokenize import SpaceTokenizer
from nltk.tokenize import RegexpTokenizer
data = "Vous êtes au volant d'une voiture et vous roulez à vitesse"
#wst = WhitespaceTokenizer()
#tokenizer = RegexpTokenizer('\s+', gaps=True)
token=WhitespaceTokenizer().tokenize(data)
print token

Output i got

['Vous', '\xeates', 'au', 'volant', "d'une", 'voiture', 'et', 'vous', 'roulez', '\xe0', 'vitesse']

Desired output

['Vous', 'êtes', 'au', 'volant', "d'une", 'voiture', 'et', 'vous', 'roulez', 'à', 'vitesse']

score 4 · Accepted Answer

In Python 2, to write UTF-8 text in your code, you need to start your file with # -*- coding: <encoding name> -*- when not using ASCII. You also need to prepend Unicode strings with u:

# -*- coding: utf-8 -*-

import nltk
...

data = u"Vous êtes au volant d'une voiture et vous roulez à grande vitesse"
print WhitespaceTokenizer().tokenize(data)

When you're not writing data in your Python code but reading it from a file, you must make sure that it's properly decoded by Python. The codecs module helps here:

import codecs

codecs.open('fichier.txt', encoding='utf-8')

This is good practice because if there is an encoding error, you will know about it right away: it won't bite you later on, eg. after processing your data. This is also the only approach that works in Python 3, where codecs.open becomes open and decoding is always done right away. More generally, avoid the 'str' Python 2 type like the plague and always stick with Unicode strings to make sure encoding is done properly.

Recommended readings:

Bon courage !

score 0 · Accepted Answer

Take at the section "3.3 Text Processing with Unicode" in Chapter 3 of NTLK.

Make sure that your string is prepended with a u and you should be ok. Also note from that chapter that, as @tripleee suggested:

There are many factors determining what glyphs are rendered on your screen. If you are sure that you have the correct encoding, but your Python code is still failing to produce the glyphs you expected, you should also check that you have the necessary fonts installed on your system.

score 0 · Accepted Answer

You don't really need the whitespace tokenizer for French if it's a simple sentence where tokens are naturally delimited by spaces. If not the nltk.tokenize.word_tokenize() would serve you better.

See How to print UTF-8 encoded text to the console in Python < 3?

# -*- coding: utf-8 -*-

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

sentence = "Vous êtes au volant d'une voiture et vous roulez à grande $3.88 vitesse"
print sentence.split()

from nltk.tokenize import word_tokenize
print word_tokenize(sentence)

from nltk.tokenize import wordpunct_tokenize
print wordpunct_tokenize(sentence)

python-2.7 - Tokenizing in french using nltk

3 回答 3

Related

Reference