Text as Data

A graduate course on computational text analysis methods for social science research, covering topic modeling, word embeddings, sentiment analysis, and large language models.

Instructor Zhicong Chen
Term Spring 2025
Location Nanjing University

Course Overview

This course introduces graduate students to the computational text analysis toolkit used in communication and social science research. Students gain hands-on experience transforming unstructured text into structured data, applying classical NLP methods and large language models.

Learning Objectives

By the end of this course, students will be able to:

  • Preprocess and represent text corpora for computational analysis
  • Apply sentiment analysis, topic modelling, and word embeddings to communication research questions
  • Fine-tune or prompt large language models for annotation and classification tasks
  • Evaluate the validity and reliability of text-based measurements
  • Design and execute an original computational text analysis study

Prerequisites

  • Proficiency in Python and pandas (Introduction to Python or equivalent)
  • Basic familiarity with social science research methods

Level

Graduate

Institution

School of Journalism and Communication, Nanjing University

Offered

Spring 2024, Spring 2025

Required Text

Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.

Key Python Libraries

Library Purpose Link
NLTK Classic NLP toolkit nltk.org
spaCy Industrial-strength NLP spacy.io
Gensim Topic models & word vectors radimrehurek.com/gensim
scikit-learn ML classifiers & vectorisers scikit-learn.org
Hugging Face Transformers BERT, GPT, and more huggingface.co
BERTopic Neural topic modelling maartengr.github.io/BERTopic
VADER Rule-based sentiment github.com/cjhutto/vaderSentiment

Useful Datasets

Assessment

Component Weight
Weekly coding labs 35%
Midterm method paper 20%
Final research project 40%
Participation 5%

Schedule

Week Date Topic Materials
1 Week 1 Introduction: Why Text as Data?

Overview of computational text analysis in social science; the bag-of-words assumption; units of analysis.

2 Week 2 Text Preprocessing

Tokenisation, lowercasing, stopword removal, stemming vs. lemmatisation, regex.

3 Week 3 Representing Text: Bag-of-Words and TF-IDF

Document-term matrices, TF-IDF weighting, cosine similarity, feature selection.

4 Week 4 Dictionary Methods and Sentiment Analysis

Lexicon-based approaches (LIWC, VADER, SentiWordNet); counting and scaling; validation.

5 Week 5 Topic Modeling: LDA

Latent Dirichlet Allocation; hyperparameters (alpha, beta, K); interpreting topics; coherence metrics.

6 Week 6 Topic Modeling: BERTopic and CTM

Neural topic models; BERTopic pipeline; comparing LDA vs. neural approaches.

7 Week 7 Word Embeddings: Word2Vec and GloVe

Distributional semantics; skip-gram and CBOW; analogies; GloVe pre-trained vectors.

8 Week 8 Contextualized Embeddings: BERT

Transformers overview; BERT architecture; fine-tuning for text classification.

9 Week 9 Text Classification

Naive Bayes, logistic regression, SVMs, transformer classifiers; train/test split; cross-validation.

10 Week 10 Large Language Models for Research

GPT-4 / Claude API as annotation tools; prompt engineering; zero-shot and few-shot classification.

11 Week 11 Named Entity Recognition and Information Extraction

NER with spaCy and Hugging Face; relation extraction; event detection.

12 Week 12 Validation and Measurement

Inter-rater reliability (Cohen’s kappa), precision/recall/F1, construct validity, replication.

13 Week 13 Final Project Presentations

Students present computational text analysis projects on communication data.