Playing with NLTK
I was playing with the Python NLTK again for the first time in a while and made myself the task of finding out all the different ways “to go” is used in English. So I was aiming for a list including stuff like “to go up”, “to go down” etc.
Apologies if the Python code is horrible, I’m still getting used to how to do things in it. Consequently I’m thinking in a mix of Perl/Ruby and still reading the manual a lot.
Code:
import nltk
from nltk.corpus import brown
from nltk.stem.regexp import *
porter = nltk.PorterStemmer()
import re
p = re.compile('\w+')
words = []
def process(sentence):
for (w1, t1), (w2, t2) in nltk.bigrams(sentence):
stem1 = porter.stem(w1)
if (stem1 == 'go' and t1.startswith('V') and p.match(t2)):
words.append( 'to ' + stem1 + ' ' + w2 + ' (' + t1 + ', ' + t2 + ')' )
for tagged_sent in brown.tagged_sents():
process(tagged_sent)
f = open('output.txt', 'w')
fdist = nltk.FreqDist(words)
for key, val in fdist.items():
f.write( "%d : %s\n" % ( val, key ) )
The results are:
202 : to go to (VBG, TO) 94 : to go to (VB, IN) 34 : to go to (VBG, IN) 29 : to go on (VB, RP) 27 : to go on (VBG, RP) 25 : to go out (VB, RP) 24 : to go back (VB, RB) 18 : to go home (VB, NR) 18 : to go into (VB, IN) 18 : to go up (VB, RP) 18 : to go with (VB, IN) 10 : to go down (VB, RP) 9 : to go back (VBG, RB) 8 : to go into (VBG, IN) 8 : to go through (VB, IN) 7 : to go a (VB, AT) 7 : to go and (VB, CC) 7 : to go for (VB, IN) 7 : to go out (VBG, RP) 7 : to go over (VB, RP) 7 : to go to (VB, TO) ...
I could do more to trim down the results, like merging all the different types of verb part-of-speech tags (VB, VBG).
Any comments, I’d be happy to hear them.
