How to get rid of punctuation using NLTK tokenizer?

I'm just starting to use NLTK and I don't quite understand how to get a list of words from text. If I use nltk.word_tokenize(), I get a list of words and punctuation. I need only the words instead. How can I get rid of punctuation? Also word_tokenize doesn't work with multiple sentences: dots are added to the last word.

Answers


Take a look at the other tokenizing options that nltk provides here. For example, you can define a tokenizer that picks out sequences of alphanumeric characters as tokens and drops everything else:

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
tokenizer.tokenize('Eighty-seven miles to go, yet.  Onward!')

Output:

['Eighty', 'seven', 'miles', 'to', 'go', 'yet', 'Onward']

You do not really need NLTK to remove punctuation. You can remove it with simple python. For strings:

import string
s = '... some string with punctuation ...'
s = s.translate(None, string.punctuation)

Or for unicode:

import string
translate_table = dict((ord(char), None) for char in string.punctuation)   
s.translate(translate_table)

and then use this string in your tokenizer.

P.S. string module have some other sets of elements that can be removed (like digits).


Below code will remove all punctuation marks as well as non alphabetic characters. Copied from their book.

http://www.nltk.org/book/ch01.html

import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time. @ sd  4 232"

words = nltk.word_tokenize(s)

words=[word.lower() for word in words if word.isalpha()]

print(words)

output

['i', 'ca', 'do', 'this', 'now', 'because', 'i', 'so', 'tired', 'please', 'give', 'me', 'some', 'time', 'sd']

As noticed in comments start with sent_tokenize(), because word_tokenize() works only on a single sentence. You can filter out punctuation with filter(). And if you have an unicode strings make sure that is a unicode object (not a 'str' encoded with some encoding like 'utf-8').

from nltk.tokenize import word_tokenize, sent_tokenize

text = '''It is a blue, small, and extraordinary ball. Like no other'''
tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
print filter(lambda word: word not in ',-', tokens)

I just used the following code, which removed all the punctuation:

tokens = nltk.wordpunct_tokenize(raw)

type(tokens)

text = nltk.Text(tokens)

type(text)  

words = [w.lower() for w in text if w.isalpha()]

I think you need some sort of regular expression matching (the following code is in Python 3):

import string
import re
import nltk

s = "I can't do this now, because I'm so tired.  Please give me some time."
l = nltk.word_tokenize(s)
ll = [x for x in l if not re.fullmatch('[' + string.punctuation + ']+', x)]
print(l)
print(ll)

Output:

['I', 'ca', "n't", 'do', 'this', 'now', ',', 'because', 'I', "'m", 'so', 'tired', '.', 'Please', 'give', 'me', 'some', 'time', '.']
['I', 'ca', "n't", 'do', 'this', 'now', 'because', 'I', "'m", 'so', 'tired', 'Please', 'give', 'me', 'some', 'time']

Should work well in most cases since it removes punctuation while preserving tokens like "n't", which can't be obtained from regex tokenizers such as wordpunct_tokenize.


I use this code to remove punctuation:

import nltk
def getTerms(sentences):
    tokens = nltk.word_tokenize(sentences)
    words = [w.lower() for w in tokens if w.isalnum()]
    print tokens
    print words

getTerms("hh, hh3h. wo shi 2 4 A . fdffdf. A&&B ")

And If you want to check whether a token is a valid English word or not, you may need PyEnchant

Tutorial:

 import enchant
 d = enchant.Dict("en_US")
 d.check("Hello")
 d.check("Helo")
 d.suggest("Helo")

Remove punctuaion(It will remove . as well as part of punctuation handling using below code)

        tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
        text_string = text_string.translate(tbl) #text_string don't have punctuation
        w = word_tokenize(text_string)  #now tokenize the string 

Sample Input/Output:

direct flat in oberoi esquire. 3 bhk 2195 saleable 1330 carpet. rate of 14500 final plus 1% floor rise. tax approx 9% only. flat cost with parking 3.89 cr plus taxes plus possession charger. middle floor. north door. arey and oberoi woods facing. 53% paymemt due. 1% transfer charge with buyer. total cost around 4.20 cr approx plus possession charges. rahul soni

['direct', 'flat', 'oberoi', 'esquire', '3', 'bhk', '2195', 'saleable', '1330', 'carpet', 'rate', '14500', 'final', 'plus', '1', 'floor', 'rise', 'tax', 'approx', '9', 'flat', 'cost', 'parking', '389', 'cr', 'plus', 'taxes', 'plus', 'possession', 'charger', 'middle', 'floor', 'north', 'door', 'arey', 'oberoi', 'woods', 'facing', '53', 'paymemt', 'due', '1', 'transfer', 'charge', 'buyer', 'total', 'cost', 'around', '420', 'cr', 'approx', 'plus', 'possession', 'charges', 'rahul', 'soni']


Need Your Help

how to export multi table in mysql?

mysql export mysqldump

i have 10+ tables, i want to export them to another datebase. how could i do that?

SSDT-BI SSIS in x64?

sql-server visual-studio-2012 ssis sql-server-data-tools ssdt-bi

I run Visual Studio 2012 and I'm new to SSIS. I attempted to install SSDT-BI (June 2013 version) in order to research SSIS, but I was unable to complete the install with my default instance because...