NYU FRE 7871: HW 1

(Derived from: Machine Learning, 10-601, CMU)

The assignment is due at 6:00 pm (beginning of class) on Tuesday, September 10, 2019 via e-mail to the instructor.
Please include your name and all collaborators' names on all files submitted (See the course webpage for the collaboration policy).
Even if you collaborate, everyone must submit their own package.

Naive Bayes

For this exercise you will be implementing a multinomial Naive Bayes classifier to classify various text articles downloaded from the Internet.

Review notes related to Naive Bayes classifiers:
- Machine Learning, 10-601, CMU
- Natural Language Processing, Stanford Coursera.
  - Naive Bayes
Download the data HW1 from the class webpage. This archive contains three folders:
- economist, containing articles from the serious European magazine Economist.com
- onion, containing articles from the not-so-serious American magazine TheOnion.com
- code, containing code, documentation and examples for parsing articles into word-count vectors. (Feel free to use another package or write your own code if you prefer.)
The economist articles are further divided into seven subfolders, each containing articles concerning a particular geo-political region: africa, asia, britain, europe, latin\_america, north\_america, international.
Convert each article from a long string of text into a vector of word counts. Specifically, for each article, count the number of times each word occurs in that article. For instance, given the string S = Will tomorrow be as sunny as today?, and a vocabulary V = {0:will, 1:tomorrow, 2:be, 3:as, 4:sunny, 5:today, 6:pajamas}, each word in our vocabulary occurs in S once, except 3:as, which occurs twice, and 6:pajamas, which occurs zero times. Thus, by using our vocabulary of words, V of length w, which assigns an integer index to each word, we can represent our original string S as an length-w vector of integer counts: {1, 1, 1, 2, 1, 1, 0}. (Code has been provided in the archive to help you parse the articles and write out their word-count vectors to files.) Notice also that as the size of your vocabulary grows, the length of the feature vector will also increase, independent of the length of the articles. That is, there will be more and more zero entries in the word-count vector. To avoid memory issues, you would be advised to use a sparse representation of these word-count vectors, perhaps by intelligently omitting/compressing the zero-count entries. Also, calculating these counts over all the articles may take a long time, so I recommend you do your testing and development over a smaller subset of the data, and only run the full experiment once you have worked out your bugs.
Now, having successfully represented each article as a single integer vector of word counts, you will perform two classification experiments:
1. Onion vs. Economist: Assign the class label econ to all the Economist articles (regardless of region/subfolder), and assign the class label onion to all the Onion articles. Your task is to implement a Multinomial Naive Bayes classifier to distinguish between these two classes of articles.
2. Economist regions: Now, for each Economist article, assign as a class label the name of the folder it is in. Thus, all articles in the asia/ folder would be labeled as class asia, those in africa/ as class africa, etc. Given this set of labellings, your task is to implement a Multinomial Naive Bayes classifier to distinguish between these seven classes of articles.
For each experiment above:
1. Evaluate your learned classifier using 10-fold cross validation and report your overall prediction accuracy, along with the precision, recall and F1 of each class. Also, summarize your errors using a confusion matrix.
2. Analyze your results: Are you surprised (positively or negatively) by the results? What did you expect and why? Examine any systematic errors your classifier makes. Can you come up with a plausible explanation?
3. Analyze your model: Train on the full set of data and look at the weights assigned to the words in your model. For each class, list the five words the model says one is most likely to observe, given that class. Do you think these words are most representative of the class? If not, list five words you feel are more representative of that class, and explain your choice. Why might these lists of words be different? Are some types of words more useful for distinguishing between classes than others? Do the same analysis for the least probable words, given the class.
4. Submit your code and any processed data or models, along with your results, via e-mail to the instructor.
Notes: To avoid possible computational underflow issues, you may want to do your likelihood calculations in log space, e.g. since ln(pⁿ) = n ln(p), pⁿ = e ^{n ln(p)}. Recall also that log(Π_i p_i) = Σ_i log(p_i). You may also find it helpful to smooth your zero-counts.