CS 122 - Assignment 5

Solution

Here are my solutions for this assignment. Note that as with all assignments, my versions are not the only way to do it!

Solution to the basic portion: solution.py

Solution to the Mastery portion: solutionMaster.py

Posted Tuesday, August 11

Instructions

In this assignment you will write a program that reads in an entire novel (in plain text format) and counts the frequency with which each word is used. Once you have counted all the words, print out a summary that lists the total number of words in the novel, the number of unique words (i.e. count the word 'of' only once, not every time it shows up), and the top 10 most frequently used words along with how many times they were used.

Here's the output from a sample run of the program reading the book Pride and Prejudice:

What file do you want to process? PrideAndPrejudice.txt Processing PrideAndPrejudice.txt Done processing PrideAndPrejudice.txt The book PrideAndPrejudice.txt contains: total words 123882 unique words 7470 The top 10 words and their frequencies are: the 4411 to 4195 of 3654 and 3578 her 2208 i 2050 a 1972 in 1886 was 1843 she 1701

For this assignment I am providing a template that you may use as a starting point. You are not required to use it, but it might give you some ideas and help you get going in the right direction.

Your program will start by prompting the user for the name of a file to read. Open the file and read it in. For each line in the file, split it up into words by using the split() function. Then loop through all the words in the line and remove the punctuation from each one by using the strip() function. Note that strip() takes an argument, so you can have it strip of characters other than whitespace.

We'll use a dictionary to keep track of the number of times we've seen each word. For each word in the line of text we're processing, see if it is in the dictionary, if so get the number of times it was seen previously and add 1 to it, otherwise just store 1 in the dictionary because this is the first time we've seen the word.

After we've read through the entire book, we have a big dictionary that lists the number of times each word in the book was used. We want to sort this and print out the top 10 words. Unfortunately, sorting a dictionary is a little bit tricky, so I provide code that will give you a list of the top 10 words. You can use this code to print the words and the number of times they were seen.

A couple of things to keep in mind as you work on this assignment:

We're not going to worry about words that are hyphenated at the end of a line. I don't think there are any in these books, but if there are, we'll treat the first part of the word and the second part of the word separately.
Make sure you get rid of all punctuation after splitting a line into words, otherwise the punctuation will be part of the words. Consider the sentence "Can you use the can opener to open that can?". After splitting it up on whitespace you will get a list of words that looks like this: ['"Can', 'you', 'use', 'the', 'can', 'opener', 'to', 'open', 'that', 'can?"'] The word "can" appears three times in the list, but each one looks different: '"Can', 'can', and 'can?"'. We want to normalize them so that python realizes they are all the same word. We do this by stripping off punctuation (like quotes and question marks) and converting every word to lower case.

Here are two of Jane Austen's books (downloaded from Project Gutenberg) to test your program with:

Pride and Prejudice or get it from its original URL
Mansfield Park or get it from its original URL

Here's another sample run of the program (this time for Mansfield Park):

What file do you want to process? Processing MansfieldPark.txt Done processing MansfieldPark.txt The book MansfieldPark.txt contains: total words 162547 unique words 9010 The top 10 words and their frequencies are: the 6371 to 5502 and 5458 of 4890 a 3138 her 3109 was 2657 in 2551 i 2357 she 2261

Mastery

Part 1: As you can see in the above results, the most common words are simple short words like "the", "and", "or", and so on. We might want to exclude words like these (called stop words) from our results. I've downloaded a file of 469 stop words. You can download it here. Modify your program so that before it starts, it reads in this list of stop words and then ignores them as it counts word frequencies.

I suggest creating a list of stop words as you read the file. Then as you process the words in the book, you can examine each word and check to see if it is in the list of stop words.

if word not in stopListWordList: # count this word because it is not a stop word

Part 2: Each book contains a header and footer that you don't want to include in the word count. You need to ignore this part of the text. The header section ends with a line that looks like

*** START OF THE PROJECT GUTENBERG EBOOK, PRIDE AND PREJUDICE ***

You want to read lines until you see one that starts like that, then start counting word frequencies. There is a similar marker that marks the end of the text.

*** END OF THE PROJECT GUTENBERG EBOOK, PRIDE AND PREJUDICE ***

Modify your code so that it skips the header and footer when counting word frequencies. I suggest reading the file in 2 sections: first have while loop that reads until it hits a line that marks end of the header, then move on to a second while loop that reads the book and counts word frequencies until it hits the start of the footer.

These two modifications will of course give you different results. I got around 52000 total non-stopwords, but I'll leave it as a surprise to see what they are.

Challenge

We read our texts from a file, but wouldn't it be cool to read them straight from the website? Read python docs and figure out how to load the file straight from the Internet. There are many ways to do this, I'd suggest starting with the urllib2 library (section 21.6 of the python library documentation)

Deliverables

Submit your python program

This assignment is due Friday, August 7th, at 7:00 am

Submit

Use this form to submit your completed assignment. You can submit an assignment multiple times, and I will generally only look at the last submission. If you do submit more than once, make sure that each submission is complete on its own and doesn't rely on something you submitted previously.

Student number (no dashes):: This is the 9-digit number on your student body card. It probably starts 950 or 951 and it looks something like this: 950123456. This number is used to identify your submission so please enter it carefully.
Name (first and last):
Email address:
Notes about this assignment:: Use the notes field to mention if you worked in a group or got help with any part of the assignment. Also note if something doesn't work correctly, it shows that you know about the problem. If you don't mention it, I'll assume you didn't test your project, and thus will grade you more harshly.
Your solution to this assignment