Andrew Arnold's

Professional • Research • Publications • Contact



"Man, I love science!"    – Beakman's World


I work on large scale machine learning at Google. I am also an Adjunct Professor at New York University. Previously, I was a portfolio manager and research director at Cubist Systematic Strategies, applying machine learning to quantitative trading. Before that, I was variously a hedge fund cofounder, CTO, quantitative portfolio manager, machine learning researcher and software engineer at Ophir Partners, Trexquant, WorldQuant, Merrill Lynch, Microsoft Research, IBM Research, Google and Bloomberg. You can learn more about my professional history here, along with my academic publications and lectures.


I graduated with a Ph.D. in machine learning from the Machine Learning Department within Carnegie Mellon University's School of Computer Science, under the supervision of William W. Cohen.

My research is generally concerned with machine learning and data mining, with an underlying interest in producing features and models that are robust to changes in the distribution of their underlying data (Thesis proposal, ICDM 2007). To this end, I am particularly interested in transfer learning with an emphasis on domain adaptation.

I was working on The Querendipity Project, whose goal is to more accurately integrate and exploit the many heterogeneous sources of information available to a modern scientist. Taking advantage of, among other sources, citation networks (such as CiteSeer), full text archives (such as PubMed Central), and curated databases (such as the Saccharomyces Genome Database), we are able to help users discover both relevant and novel research related to their interests (ICWSM 2009).

I was also a member of the SLIF team working on mining text and images together for bioinformatics applications. Our team was one of four finalists in the $50,000 Elsevier Grand Challenge. Specifically, my work deals with using the text of biological journal articles (e.g. captions, abstracts and main text) along with their associated images (depicting cells, proteins, graphs, etc) in order to better identify entities in both media. The combination of these two expressions (text and images) of the same underlying concept (the experiment being performed) into new features, jointly describing both the text and images, is a closer representation of the actual object a user would be interested in, rather than disjoint features of text and images alone. A related problem is that of transfer learning. In this case, we use models and named entity extractors trained on one type of data (abstract text, for instance) and adapt them to be applied to a related, but distinct type of data (caption text) (CIKM 2008, ACL 2008). The intuition is that it is easier to learn a certain concept once a related concept has already been mastered.

I have also been lucky to pursue related work outside of school during summer internships. While working with Hang Li and Tie-Yan Liu in the Web Search and Mining group at Microsoft Research Asia we developed novel semi-supervised and transfer learning based methods for improving internet search through query-dependent ranking (SIGIR 2008). The idea behind this work is that, regardless of the specific topic users are interested in, there are common features linking certain types of queries together. For instance, users searching for either a person or company name might both be most interested in the corresponding home page (a navigational query), while searchers for a disease or country name might be more interested in authoritative sources of information about these topics (informational queries). By modeling and leveraging these distributions of types of queries we can better decide what, exactly, users want and deliver that to them.

Relatedly, while in the Data Analytics group at IBM Research Watson, I worked with Naoki Abe and Yan Liu on methods for learning causal models from temporally ordered data (KDD 2007). We felt that the interpretability offered by a causal model was quite valuable for the end user in understanding the process being studied. This type of understanding is an essential component of the scientific process since it leads the researcher to an idea of what experiment to perform next. An accurate predictive model, without interpretation, provides little insight as to what direction is best to pursue. This was also the motivation behind my work with Richard Scheines and Joseph E. Beck on discovering predictive, semantically and scientifically interpretable high-level features as functions of raw, event level data (AAAI 2006, 2005).

I did my undergraduate work in the Intrusion Detection System Group within the Computer Science Department of Columbia University, under the supervision of Professor Salvatore J. Stolfo and Eleazar Eskin. My work there dealt with applying kernel methods and support vector machines to the problem of clustering data (binaries, system calls, network packets, etc.) in order to identify possible attacks (DMSA 2002).

In my spare time I work on applying machine learning techniques towards opponent modeling for Texas Hold 'em poker and Tic-Tac-Toe.










Email:  andrew . arnold @ gmail . com