Overview:
Course: News Analytics and Machine Learning (NYU FRE GY 7871  I2)
Term: Fall 2019, first half
Instructor: Andrew Arnold (aoa216@nyu.edu)
Disclaimer: All views and opinions expressed by the instructor in this course are his own and do not reflect the views, opinions, or confidential information of any of his current or former employers.
Office hours: by appointment
GA/Grader/Tutor: TBD
Location: Rogers Hall, Room 216 (Brooklyn Campus)
Time: Tuesdays, 6:00 PM  8:41 PM
Course style: Given the small class size, the course will be taught as a colloquium. New topics will be introduced in interactive lecture format, and then discussed and expanded by the group. These topics will then be built upon in the team projects, which will be further discussed and presented to the class. Active class attendance and participation is required.
Grading:
 Attendance and participation*: 25%
 Homework*: 10%
 Midterm exam: 15%
 Course project: 50% total
 Project proposal: 15%
 Midterm presentation: 35%
 Final presentation: 50%
* Note about late registration: Since the class only meets seven times and the first homework is assigned on the first day of class, it may be difficult to make up for missed homework and attendance if you miss even the first day of class. Please let me know if you are considering joining the class late so we can discuss the implications.
Collaboration policy: As in the real world, collaboration is encouraged, but plagiarism is not. Transparency is the difference. If you collaborate (with other members of this class, other classes, colleagues, friends, random people on the internet) that is fine, just state so. If the contributions of authors for a particular work is uneven, just give a rough estimate of each author's contribution (e.g., A did most of the math, B did most of the programming, and C did the literature review). Feel free to use all publicly available resources on the internet, but please cite them if they are used as more than basic background research (both to give proper credit to the original author and to help your peers discover new resources). Since this is a special topics course, I tend to assume students are interested in learning the material and thus give the benefit of the doubt. If proper credit is not given, however, or if bad faith / dishonesty is shown, consequences can be severe, including failing the class and referral to the administration.
Abstract:
The fastgrowing field of news analytics requires large databases, fast computation, and robust statistics. This course introduces the tools and techniques of analyzing news, how to quantify textual items based on, for example, positive or negative sentiment, relevance to each stock, and the amount of novelty in the content. Applications to trading strategies are discussed, including both absolute and relative return strategies, and risk management strategies. Students will be exposed to leading software in this space.
Students will benefit from some familiarity with basic probability, statistics and programming (python), and an interest in natural language processing (NLP) or computational linguistics. While the course will introduce a few trading strategies, it will also focus on NLP as a tool in its own right, applicable to domains outside of quantitative trading strategies.
There will be readings, discussion, homework, a midterm exam and a final project.
Course outcomes:
After this course you should be able to:
 Build a basic trading strategy based on natural language signals:
 Identify, locate and clean appropriate data sources.
 Formulate a trading hypothesis based on natural language signals.
 Investigate this hypothesis qualitatively and quantitatively, using statistical, programming, nlp and trading best practices.
 Present the results of your investigation to your peers for feedback and analysis.
 Read an academic paper / industry whitepaper about natural language techniques applied to trading and have a basic understanding of it.
 Have a sense of where the state of the art is currently and where it might head in the near future. Know the difference between science fiction and reality.
 Decide if you would like to pursue further research in this area.
Prerequisites:
 Foundations of Financial Technology (FREGY 6153) or equivalent:
 Basic knowledge of financial markets (What is a stock? How does it trade?)
 Basic statistics (What is variance?)
 Big Data in Finance (FREGY 7221) or equivalent:
 Basic programming ability (Parse a csv file and calculate the variance of the values. Python/R/Matlab)
 Test: Given enough time and access to the internet could you:
 Determine the 10 largest US stocks by market capitalize as of 12/31/2018
 Download the closing prices for these stocks for the last 5 Tuesdays of 2018
 Calculate the variance of each stock during that period
If so, you are qualified to take this course.
Schedule:
 Tuesday, September 3, 2019:
 Course overview
 Introduction to natural language processing (NLP) and machine learning (ML).
 HW 1 assigned (HW 1 data), due 6:00 pm (beginning of class) on Tuesday, September 10, 2019 via email to the instructor.
Slides:
Supplemental:
 Tuesday, September 10, 2019:
 Tuesday, September 17, 2019:
 Natural language processing for quantitative trading.
 Tuesday, September 24, 2019:
 Midterm exam
 Machine learning for quantitative trading.
 Tuesday, October 1, 2019:
 Project midterm reports are due.
 Project midterm presentations and discussion.
 Advanced topics in natural language processing and machine learning.
 Tuesday, October 8, 2019:
 Advanced topics in quantitative trading.
Tuesday, October 15, 2019:
 NO CLASS (NYU Legislative Day  Classes will meet according to a Monday schedule)
 Tuesday, October 22, 2019:
 Project presentations and discussion.
Supplementary material:
There are many excellent nlp courses taught around the world each year, most with lectures freely available on the internet. If there is a particular topic you would like more background on, or further topics we did not have time to explore in class, I encourage you to take advantage of these resources. As always, if you do reference this material in your work, please cite it.
 Natural Language Processing, Dan Jurafsky and Christopher Manning, Stanford Coursera.
 Natural Language Processing, Jason Eisner, Johns Hopkins (JHU).
 Foundations of Statistical Natural Language Processing, Chris Manning and Hinrich Schutze, Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA: May 1999.
 Overfitting / Biasvariance tradeoff, Daniel Geng and Shannon Shih, UC Berkeley.
 Language and Statistics , Roni Rosenfeld , CMU.
 Sentiment Analysis and Opinion Mining (tutorial), Bing Liu, UIC.
 Sentiment Analysis and Opinion Mining (book), Bing Liu, UIC.
 Introduction to Natural Language Processing, David Smith, UMass.
 Natural Language Processing with Deep Learning, Richard Socher, Stanford.
Unfortunately, there are not as many publicly available resources on developing quantitative trading strategies. Nevertheless, there are still a (growing) number of excellent resources, including:
Here are some publicly available datasets:
Homework:
 HW 1 assigned (HW 1 data), due 6:00 pm (beginning of class) on Tuesday, September 10, 2019 via email to the instructor.
Course project:
