

There is also a corpus of instant messaging chat sessions, originally collectedīy the Naval Postgraduate School for research on automatic detection of Internet predators. wine.txt Lovely delicate, fragrant Rhone wine. singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encoun. pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terr. overheard.txt White guy: So, do you have any plans for this evening? Asian girl. grail.txt SCENE 1: KING ARTHUR: Whoa there! [clop. firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to se. The sents() function divides the text up into its sentences, where each sentence is Tells us how many letters occur in the text, including the spaces between words. So, for example, len(gutenberg.raw( 'blake-poems.txt')) The raw() function gives us the contents of the file The previous example also showed how we can access the "raw" text of the book , (In fact, the average word length is reallyģ not 4, since the num_chars variable counts space characters.)īy contrast average sentence length and lexical diversityĪppear to be characteristics of particular authors. Observe that average word length appears to be a general property of English, since Item appears in the text on average (our lexical diversity score). This program displays three statistics for each text:Īverage word length, average sentence length, and the number of times each vocabulary 5 25 26 austen-emma.txt 5 26 17 austen-persuasion.txt 5 28 22 austen-sense.txt 4 34 79 bible-kjv.txt 5 19 5 blake-poems.txt 4 19 14 bryant-stories.txt 4 18 12 burgess-busterbrown.txt 4 20 13 carroll-alice.txt 5 20 12 chesterton-ball.txt 5 23 11 chesterton-brown.txt 5 18 11 chesterton-thursday.txt 4 21 25 edgeworth-parents.txt 5 26 15 melville-moby_dick.txt 5 52 11 milton-paradise.txt 4 12 9 shakespeare-caesar.txt 4 12 8 shakespeare-hamlet.txt 4 12 7 shakespeare-macbeth.txt 5 36 12 whitman-leaves.txt print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid) num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
