Text as Data | Zhicong Chen, Ph.D.

Instructor Zhicong Chen

Term Spring 2025

Location Nanjing University

Course Overview

This course introduces graduate students to the computational text analysis toolkit used in communication and social science research. Students gain hands-on experience transforming unstructured text into structured data, applying classical NLP methods and large language models.

Learning Objectives

By the end of this course, students will be able to:

Preprocess and represent text corpora for computational analysis
Apply sentiment analysis, topic modelling, and word embeddings to communication research questions
Fine-tune or prompt large language models for annotation and classification tasks
Evaluate the validity and reliability of text-based measurements
Design and execute an original computational text analysis study

Prerequisites

Proficiency in Python and pandas (Introduction to Python or equivalent)
Basic familiarity with social science research methods

Level

Graduate

Institution

School of Journalism and Communication, Nanjing University

Offered

Spring 2024, Spring 2025

Required Text

Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.

Key Python Libraries

Library	Purpose	Link
NLTK	Classic NLP toolkit	nltk.org
spaCy	Industrial-strength NLP	spacy.io
Gensim	Topic models & word vectors	radimrehurek.com/gensim
scikit-learn	ML classifiers & vectorisers	scikit-learn.org
Hugging Face Transformers	BERT, GPT, and more	huggingface.co
BERTopic	Neural topic modelling	maartengr.github.io/BERTopic
VADER	Rule-based sentiment	github.com/cjhutto/vaderSentiment

Useful Datasets

Hugging Face Datasets Hub — thousands of NLP datasets
GDELT Project — global news events
Pushshift Reddit Archive

Assessment

Component	Weight
Weekly coding labs	35%
Midterm method paper	20%
Final research project	40%
Participation	5%

Schedule

Week	Date	Topic	Materials
1	Week 1	Introduction: Why Text as Data? Overview of computational text analysis in social science; the bag-of-words assumption; units of analysis.	Grimmer et al. — Text as Data (Princeton UP) Computational Social Science (Lazer et al., 2020)
2	Week 2	Text Preprocessing Tokenisation, lowercasing, stopword removal, stemming vs. lemmatisation, regex.	NLTK Book — Chapter 3 spaCy 101 spaCy Download
3	Week 3	Representing Text: Bag-of-Words and TF-IDF Document-term matrices, TF-IDF weighting, cosine similarity, feature selection.	scikit-learn: Text Feature Extraction TF-IDF Explained
4	Week 4	Dictionary Methods and Sentiment Analysis Lexicon-based approaches (LIWC, VADER, SentiWordNet); counting and scaling; validation.	VADER Sentiment (Python) LIWC TextBlob
5	Week 5	Topic Modeling: LDA Latent Dirichlet Allocation; hyperparameters (alpha, beta, K); interpreting topics; coherence metrics.	Gensim LDA Tutorial LDA Visualization (pyLDAvis) Blei et al. 2003 — Original LDA Paper
6	Week 6	Topic Modeling: BERTopic and CTM Neural topic models; BERTopic pipeline; comparing LDA vs. neural approaches.	BERTopic Documentation BERTopic Paper
7	Week 7	Word Embeddings: Word2Vec and GloVe Distributional semantics; skip-gram and CBOW; analogies; GloVe pre-trained vectors.	Gensim Word2Vec Tutorial GloVe Vectors (Stanford NLP) Word2Vec Illustrated
8	Week 8	Contextualized Embeddings: BERT Transformers overview; BERT architecture; fine-tuning for text classification.	Hugging Face Transformers Docs The Illustrated BERT (Jay Alammar) BERT Paper
9	Week 9	Text Classification Naive Bayes, logistic regression, SVMs, transformer classifiers; train/test split; cross-validation.	scikit-learn Text Classification Tutorial Hugging Face Text Classification
10	Week 10	Large Language Models for Research GPT-4 / Claude API as annotation tools; prompt engineering; zero-shot and few-shot classification.	OpenAI API Documentation Anthropic Claude API Ziems et al. 2024 — Can LLMs replace human annotators?
11	Week 11	Named Entity Recognition and Information Extraction NER with spaCy and Hugging Face; relation extraction; event detection.	spaCy NER Docs Hugging Face NER
12	Week 12	Validation and Measurement Inter-rater reliability (Cohen’s kappa), precision/recall/F1, construct validity, replication.	Krippendorff's Alpha Guidelines for Human Annotation (Artstein & Poesio)
13	Week 13	Final Project Presentations Students present computational text analysis projects on communication data.