20. Frequency Analysis
Find the most frequent words in a text document.
In this assignment, we will find the most frequent words in Jane Austen’s Sense and Sensibility.
Your goal is to print the top 100 most frequent words and their frequency count. Your method should be fast and flexible so that you can handle longer documents and more or less than the top 100.
- Read the information in the file using
FileandScanner. - Clean up the information by removing non-word characters. Make everything lowercase for consistency. This step is surprisingly tricky to get perfect. I recommend
replaceAllusing the regular expression"\\W+"to match at least one non-word character. See a handy cheat sheet (pdf) for more information. - Create a
Mapto count each word and their frequency. - Create your own custom
WordCountclass that will hold one word and its frequency together.- Constructor
- Write a
toStringmethod. - Write a
compareTomethod and implement theComparableinterface. This should be written so that the highest frequency comes first.
- Make a
TreeSetwith theWordCountclass using the data from the file. - Print the top 100 most frequent words in your
TreeSet.
Last modified August 25, 2021: ap-cs 2017-2018 (3bb9976)