CIS 122 Assignment 4

You're studying english lit. and your professor wants you to discuss the evolution of Jane Austen's word choice over time. Counting the frequencies by hand is a daunting task, so write a computer program to do it for you!

Due Date Change
This assignment looks like it is taking longer than expected. Let's push the due date back to Monday, July 7, so everyone has more time to work on it if needed.
Instructions

In this assignment you will write a program that reads in an entire novel (in plain text format) and counts the frequency that each word is used. Once you have counted all the words, print out a summary that lists the total number of words in the novel, the number of unique words (i.e. count the word 'of' only once, not every time it shows up), and the top 10 most frequently used words along with how many times they were used.

This assignment requires you to read in a file, split each line up into individual words, and use a dictionary to count the number of times each word was seen. Here are some things to keep in mind:

  • Each book contains a header and footer that you don't want to include in the word count. You need to ignore this part of the text. The header section ends with a line that looks like
                            
    *** START OF THE PROJECT GUTENBERG EBOOK, PRIDE AND PREJUDICE ***
                        
    You want to read lines until you see one that starts like that, then start counting word frequencies. There is a similar marker that marks the end of the text.
                            
    *** END OF THE PROJECT GUTENBERG EBOOK, PRIDE AND PREJUDICE ***
                        
  • We're not going to worry about words that are hyphenated at the end of a line. I don't think there are any in these books, but if there are, we'll treat the first part of the word and the second part of the word separately.
  • Make sure you get rid of all punctuation before splitting a line into words, otherwise the punctuation will be part of the words.
  • To find the top words, we'll turn your dictionary of frequencies into a list and sort it. This code is kind of funky, so I provide a function that does it for you in the example

There is a template available that you may use as a starting point. You are not required to use this template, but it might give you some ideas and help you get going.

Here are two of Jane Austen's books (downloaded from Project Gutenberg) to test your program with.

Input

This program needs to ask the user for the following information

  • What book to read in
The full text of the book (in the format used by project Gutenberg) also needs to be present at the location specified by the user.

Important: you don't need to validate the users input. If they mistype the name of the file, it's okay if your program just dies.

Output

The output of your program should look like the examples below. If you want to have fun and make it fancier, you are welcome to.

Here's what one sample run of the program looks like:

What file do you want to process? MansfieldPark.txt
  processing MansfieldPark.txt
  done processing MansfieldPark.txt

The book MansfieldPark.txt contains:
                         words 160617
                  unique words 8552
The top 10 words and their frequencies are:
                           the 6195
                            to 5421
                           and 5391
                            of 4777
                             a 3079
                           her 3074
                           was 2648
                            in 2493
                             i 2346
                            it 2258
                

And here's another (longer) sample run:

What file do you want to process? PrideAndPrejudice.txt
  processing PrideAndPrejudice.txt
  done processing PrideAndPrejudice.txt

The book PrideAndPrejudice.txt contains:
                         words 122275
                  unique words 6877
The top 10 words and their frequencies are:
                           the 4321
                            to 4126
                            of 3595
                           and 3529
                           her 2195
                             i 2047
                             a 1941
                            in 1861
                           was 1840
                           she 1688
                

Extra credit

(A) Words like 'to', 'the', 'and', 'or', and similar short words are generally not as interesting. They are called 'stop words' and are generally ignored in this kind of processing. Find a list of stop words on the internet and modify your program to ignore those words.

(B) We read our texts from a file, but wouldn't it be cool to read them straight from the website? Read python docs and figure out how to load the file straight from the Internet. There are many ways to do this, I'd suggest starting with the urllib library (section 18.5 of the python library documentation)

Make sure you specify in the notes field if you did one of the extra credit options.

Deliverables

Submit your python program via the submit form at the bottom of this page.

This assignment is due Thursday, July 3rd at 5:00pm.
Due date pushed back to July 7th!

Use the submit form at the bottom of this page to turn in your work.

Submit

Use this form to submit your completed assignment. If you submit it once and then find a mistake and want to submit it again, you can. You can submit as many versions as you'd like and I'll generally only look at the last one.

What is your student #?
What is your name?
What is your email?
Notes about this assignment:
Use the notes field to mention if you worked in a group or got help with any part of the assignment. Also note if something doesn't work correctly, it shows that you know about the problem. If you don't mention it, I'll assume you didn't test your code, and thus will grade you more harshly.
What file would you like to submit?
This assignment only requires you to turn in one file. If you think you need to turn in more than one, let me know about it and we'll figure out why.

The only acceptable file type to turn in for this assignment is the source code to your program (your .py file).


U of O | CIS | Questions? Ask me!