Text as Data
A graduate course on computational text analysis methods for social science research, covering topic modeling, word embeddings, sentiment analysis, and large language models.
Course Overview
This course introduces graduate students to the computational text analysis toolkit used in communication and social science research. Students gain hands-on experience transforming unstructured text into structured data, applying classical NLP methods and large language models.
Learning Objectives
By the end of this course, students will be able to:
- Preprocess and represent text corpora for computational analysis
- Apply sentiment analysis, topic modelling, and word embeddings to communication research questions
- Fine-tune or prompt large language models for annotation and classification tasks
- Evaluate the validity and reliability of text-based measurements
- Design and execute an original computational text analysis study
Prerequisites
- Proficiency in Python and pandas (Introduction to Python or equivalent)
- Basic familiarity with social science research methods
Level
Graduate
Institution
School of Journalism and Communication, Nanjing University
Offered
Spring 2024, Spring 2025
Required Text
Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
Key Python Libraries
| Library | Purpose | Link |
|---|---|---|
| NLTK | Classic NLP toolkit | nltk.org |
| spaCy | Industrial-strength NLP | spacy.io |
| Gensim | Topic models & word vectors | radimrehurek.com/gensim |
| scikit-learn | ML classifiers & vectorisers | scikit-learn.org |
| Hugging Face Transformers | BERT, GPT, and more | huggingface.co |
| BERTopic | Neural topic modelling | maartengr.github.io/BERTopic |
| VADER | Rule-based sentiment | github.com/cjhutto/vaderSentiment |
Useful Datasets
- Hugging Face Datasets Hub — thousands of NLP datasets
- GDELT Project — global news events
- Pushshift Reddit Archive
Assessment
| Component | Weight |
|---|---|
| Weekly coding labs | 35% |
| Midterm method paper | 20% |
| Final research project | 40% |
| Participation | 5% |
Schedule
| Week | Date | Topic | Materials |
|---|---|---|---|
| 1 | Week 1 | Introduction: Why Text as Data? Overview of computational text analysis in social science; the bag-of-words assumption; units of analysis. | |
| 2 | Week 2 | Text Preprocessing Tokenisation, lowercasing, stopword removal, stemming vs. lemmatisation, regex. | |
| 3 | Week 3 | Representing Text: Bag-of-Words and TF-IDF Document-term matrices, TF-IDF weighting, cosine similarity, feature selection. | |
| 4 | Week 4 | Dictionary Methods and Sentiment Analysis Lexicon-based approaches (LIWC, VADER, SentiWordNet); counting and scaling; validation. | |
| 5 | Week 5 | Topic Modeling: LDA Latent Dirichlet Allocation; hyperparameters (alpha, beta, K); interpreting topics; coherence metrics. | |
| 6 | Week 6 | Topic Modeling: BERTopic and CTM Neural topic models; BERTopic pipeline; comparing LDA vs. neural approaches. | |
| 7 | Week 7 | Word Embeddings: Word2Vec and GloVe Distributional semantics; skip-gram and CBOW; analogies; GloVe pre-trained vectors. | |
| 8 | Week 8 | Contextualized Embeddings: BERT Transformers overview; BERT architecture; fine-tuning for text classification. | |
| 9 | Week 9 | Text Classification Naive Bayes, logistic regression, SVMs, transformer classifiers; train/test split; cross-validation. | |
| 10 | Week 10 | Large Language Models for Research GPT-4 / Claude API as annotation tools; prompt engineering; zero-shot and few-shot classification. | |
| 11 | Week 11 | Named Entity Recognition and Information Extraction NER with spaCy and Hugging Face; relation extraction; event detection. | |
| 12 | Week 12 | Validation and Measurement Inter-rater reliability (Cohen’s kappa), precision/recall/F1, construct validity, replication. | |
| 13 | Week 13 | Final Project Presentations Students present computational text analysis projects on communication data. |