Text as Data

Spring 2016
Lecture: Wednesdays 5:10 – 7:00 pm in Warren Weaver Hall 109    Instructor:  Prof. Arthur Spirling, arthur.spirling@nyu.edu
Section: Thursdays 6:10 – 7:00 pm in SILV 207     Teaching Assistant: Kevin Munger, km2713@nyu.edu


At the very least, students should have a fi rst class in statistics and/or inference under their belt
before taking this course. In particular, basic knowledge of calculus, probability, densities, distri-
butions, statistical tests, hypothesis testing, the linear model, maximum likelihood and generalized
linear models is assumed. The core language and software environment of this course is
R. If you are not familiar with R, you will struggle with the assigned exercises. Please check with
the instructor if you unclear as to whether you are qualifi ed for this course.


The availability of text data has exploded in recent times, and so has the demand for analysis of
that data. This course introduces students to the quantitative analysis of text from a social science
perspective, with a special focus on politics. The course is applied in nature, and while we will give
some theoretical treatment of the topics at hand, the primary aim to help students understand the
types of questions we can ask with text, and how to go about answering them. With that in mind,
we first explain how texts may be modeled as quantitative entities and discuss how they might be
compared. We then move to both supervised and unsupervised techniques in some detail, before
dealing with some ‘special topics’ that arise in particular lines of social science research. Ultimately,
the goal is to help student conduct their own text as data research projects and this class provides
the foundations on which more focused, technical research can be built.

While many of the techniques we discuss have their origins in computer science or statistics, this
is not a CS class: we will spend relatively little time on traditional Natural Language Processing
issues (such as machine translation, optical character recognition, parts of speech tagging etc).
Other o fferings in the university cover those matters more than adequately. Similarly, this class
will not much deal with obtaining text data: again, there are excellent classes elsewhere dealing
with e.g. web-scraping.


This course provides once-weekly meetings (two 50 minute lectures) with the instructing professor,
and a 50 minute section with the TA. Enrolled students must attend all meetings. The information
and skills that you need to complete your homework assignments and term projects will be provided
by the Professor or the TA. Generally speaking the fi rst lecture of a given session will deal with
theoretical/technical/modeling issues, while the second will be more applied in nature.


There are no written exams in the class, and your grade will be based on a combination of:
  • Homeworks (50%): There will be (at least) three homeworks, all of which will involve modeling and coding of text data, and some theoretical work. Intellectual honesty is important at NYU: you may confer with colleagues, but all work on the homework must be your own. If you copy work or allow another student to copy your work, the homework will be graded zero and your case will be passed to appropriate authorities in the university.
  • Final Paper (50%): There will be a final written paper of not longer than 12 double spaced pages of text, which explores an original research project or idea. This may be substantive or technical in nature. You are encouraged to work in teams of up to two people on this paper.
The deadline for the paper will be May 11, 2016 with no extensions or exceptions