之前在网上看到了一篇使用LSTM进行时间序列预测的教程,采用的是Keras框架,本文的主要工作是尝试理解这整个过程并改用PyTorch框架重写一遍。在此之前,笔者只安装过TensorFlow和PyTorch的编程环境(还是基于CPU的),然后跑过官网上一两个Getting Started之类的Tutorial,因此可以说是Start From Scratch了。

原文在此:Multivariate Time Series Forecasting with LSTMs in Keras。此外,还有一篇相关的文章,也是用Keras做的:LSTM Neural Network for Time Series Prediction, 可以在Github上看到Source Code

下面开始解剖整个过程

数据准备

首先是数据准备,在原文中,使用的是环境监测的数据集,包含的属性主要有:

  • No: row number
  • year: year of data in this row
  • month: month of data in this row
  • day: day of data in this row
  • hour: hour of data in this row
  • pm2.5: PM2.5 concentration
  • DEWP: Dew Point
  • TEMP: Temperature
  • PRES: Pressure
  • cbwd: Combined wind direction
  • Iws: Cumulated wind speed
  • Is: Cumulated hours of snow
  • Ir: Cumulated hours of rain

原来的DataFrame长这样

1
2
3
4
No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
1,2010,1,1,0,NA,-21,-11,1021,NW,1.79,0,0
2,2010,1,1,1,NA,-21,-12,1020,NW,4.92,0,0
……

将日期数据用pandas合并成一列

1
2
3
def parse(x):
return datetime.strptime(x, '%Y %m %d %H')
dataset = read_csv('raw.csv', parse_dates = [['year', 'month', 'day', 'hour']], index_col=0, date_parser=parse)

得到新的DataFrame

1
2
3
4
5
6
7
date,pollution,dew,temp,press,wnd_dir,wnd_spd,snow,rain
2010-01-02 00:00:00,129.0,-16,-4.0,1020.0,SE,1.79,0,0
2010-01-02 01:00:00,148.0,-15,-4.0,1020.0,SE,2.68,0,0
2010-01-02 02:00:00,159.0,-11,-5.0,1021.0,SE,3.57,0,0
2010-01-02 03:00:00,181.0,-7,-5.0,1022.0,SE,5.36,1,0
2010-01-02 04:00:00,138.0,-7,-5.0,1022.0,SE,6.25,2,0
……

接下来需要做的工作是把时间序列数据转化为可以进行监督学习的数据,参见这篇文章

下面是代码,定义了一个函数series_to_supervised,用来把原来的时间序列数据转化成监督学习的数据集。在调用这个函数之前,用Sci-kit Learn中的两个类进行了数据预处理,先是用LabelEncoder把数据中非数值特征(风向-wnd_dir)转化成了从0开始的数值特征,然后用MinMaxScaler对整个数据集进行了标准化。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# convert series to supervised learning
def series_to_supervised(data, n_in=1, n_out=1, dropnan=True):
n_vars = 1 if type(data) is list else data.shape[1]
df = DataFrame(data)
cols, names = list(), list()
# input sequence (t-n, ... t-1)
for i in range(n_in, 0, -1):
cols.append(df.shift(i))
names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
# forecast sequence (t, t+1, ... t+n)
for i in range(0, n_out):
cols.append(df.shift(-i))
if i == 0:
names += [('var%d(t)' % (j+1)) for j in range(n_vars)]
else:
names += [('var%d(t+%d)' % (j+1, i)) for j in range(n_vars)]
# put it all together
agg = concat(cols, axis=1)
agg.columns = names
# drop rows with NaN values
if dropnan:
agg.dropna(inplace=True)
return agg
# load dataset
values = dataset.values
# integer encode direction
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
values[:,4] = encoder.fit_transform(values[:,4])
# ensure all data is float
values = values.astype('float32')
# normalize features
scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(values)
# frame as supervised learning
reframed = series_to_supervised(scaled, 1, 1)
# drop columns we don't want to predict
reframed.drop(reframed.columns[[9,10,11,12,13,14,15]], axis=1, inplace=True)
print(reframed.head())

这样可以得到的DataFrame长这样,一共是8个特征,作为X,预测值Y是其中的第一个特征,即pm2.5的污染量,因为是预测时间序列数据,所以Y可以是X中的某一个特征,只不过是利用t-1时刻的值预测t时刻的值。

1
2
3
4
5
6
7
8
9
10
11
12
13
var1(t-1) var2(t-1) var3(t-1) var4(t-1) var5(t-1) var6(t-1) \
1 0.129779 0.352941 0.245902 0.527273 0.666667 0.002290
2 0.148893 0.367647 0.245902 0.527273 0.666667 0.003811
3 0.159960 0.426471 0.229508 0.545454 0.666667 0.005332
4 0.182093 0.485294 0.229508 0.563637 0.666667 0.008391
5 0.138833 0.485294 0.229508 0.563637 0.666667 0.009912
var7(t-1) var8(t-1) var1(t)
1 0.000000 0.0 0.148893
2 0.000000 0.0 0.159960
3 0.000000 0.0 0.182093
4 0.037037 0.0 0.138833
5 0.074074 0.0 0.109658

构建LSTM网络

关于LSTM模型的介绍可以参考这篇:理解LSTM网络(译)

在LSTM模型中,每个cell都包含一个hidden state和一个cell state,分别记为h和c,对应于这个cell的输入,在cell中通过定义一系列的函数,有点类似于数字电路中的“门”的概念,从而实现一些诸如“遗忘”的功能。这些具体的函数已经被PyTorch等深度学习框架封装好了,因此我们需要做的就是定义h和c。在原文中,作者使用了Keras进行神经网络的搭建,他把隐层定义为50个神经元(我的理解其实就是说hidden state包含有50个feature),在这之后又接了一个Dense层,这应该是为了把隐层的计算结果映射出一个output值。

1
2
3
4
# design network
model = Sequential()
model.add(LSTM(50, input_shape=(train_X.shape[1], train_X.shape[2])))
model.add(Dense(1))

在PyTorch中,采用如下的方法定义这个网络。建立一个有两个LSTMCell构成的Sequence网络,然后给定初始化的h0和c0,把输入和输出喂给这两个cell即可。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class Sequence(nn.Module):
def __init__(self):
super(Sequence, self).__init__()
# the hidden_size is 51
self.lstm1 = nn.LSTMCell(1, 51)
self.lstm2 = nn.LSTMCell(51, 1)
def forward(self, input, future=0):
outputs = []
# both the input(h_t, c_t) and output(h_t2, c_t2) are initialized to zeros
h_t = Variable(
torch.zeros(input.size(0), 51), requires_grad=False)
c_t = Variable(
torch.zeros(input.size(0), 51), requires_grad=False)
h_t2 = Variable(
torch.zeros(input.size(0), 1), requires_grad=False)
c_t2 = Variable(
torch.zeros(input.size(0), 1), requires_grad=False)
for i, input_t in enumerate(input.chunk(input.size(1), dim=1)):
h_t, c_t = self.lstm1(input_t, (h_t, c_t))
h_t2, c_t2 = self.lstm2(c_t, (h_t2, c_t2))
outputs += [c_t2]
outputs = torch.stack(outputs, 1).squeeze(2)
return outputs

这里参考了以下三篇文档:

模型训练

训练这样一个网络,需要定义相应的损失函数loss function和优化算法,然后就可以套一下代码的模板进行训练了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# build the model
seq = Sequence()
criterion = nn.MSELoss()
# use LBFGS as optimizer since we can load the whole data to train
optimizer = optim.LBFGS(seq.parameters())
loss_list = []
test_loss_list = []
epoch_num = 50
# begin to train
for epoch in range(epoch_num):
print('epoch : ', epoch)
def closure():
optimizer.zero_grad()
out = seq(train_X)
loss = criterion(out, train_y)
# print('loss:', loss.data.numpy()[0])
loss_list.append(loss.data.numpy()[0])
loss.backward()
return loss
optimizer.step(closure)
pred = seq(test_X)
loss = criterion(pred, test_y)
# print('test loss:', loss.data.numpy()[0])
test_loss_list.append(loss.data.numpy()[0])
y = pred.data.numpy()

详细代码参见:http://nbviewer.jupyter.org/github/zhicongchen/ml-beginners/blob/master/Multivariate%20Time%20Series%20Forecasting%20with%20LSTMs%20in%20PyTorch.ipynb

简介

原作地址:https://github.com/hjptriplebee/Chinese_poem_generator

fork了本科同学写的一个简单的写诗机器人,训练数据是从网上抓取的3万首唐诗,模型是LSTM(RNN),原理和Word2Vec类似,详情可以读他的CSDN博客:http://blog.csdn.net/accepthjp/article/details/73875108

运行环境

  • MacOS X
  • Python2.7 on Anaconda
  • TensorFlow1.0

原作使用的是Python3的运行环境,我电脑上装的是Anaconda,Python版本还是2.7,于是干脆用2.7来运行,debug的过程正好可以把代码仔细读一遍,也算是学习了。

用法

1
"python main.py -m {train, test, head}" train训练, test随机写诗, head藏头诗.

学习笔记

笔者此前没有接触过深度学习和TensorFlow框架,只读过周志华老师《机器学习》的部分章节,因此从零开始配置环境和学习。

首先是安装TensorFlow:https://www.tensorflow.org/install/install_mac

我采用的native pip的方法

1
pip install tensorflow

顺便读了一下Getting started with tensorflow:https://www.tensorflow.org/get_started/get_started ,大致了解一下TensorFlow框架的使用,教程介绍的是linear model

接下来把项目clone到本地,直接用python来运行,在model.py文件的第57行遇到了错误,应该是无法读取原来的checkpoint的缘故,于是将checkpoint/目录下的所有文件删了,问题解决

1
saver.restore(sess, checkPoint.model_checkpoint_path)

错误详见:屏幕快照 2017-08-11 下午1.01.50.png

虽然可以运行了,但生成的诗大致长这样:

1
2
橬咑梕兂门鱦伎,庈吜遯关浐尘有。
仱比怜敡扲与眲,无南笔蜡鰢恨否。

一开始以为是训练次数太少的原因,仔细想了一下,感觉是字符编码的问题,因为Python3在编码方面比2好很多,所以才会出这种问题。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
本来的poem长这样:
[贞条障曲砌,翠叶贯寒霜。拂牖分龙影,临池待凤翔。]
对应的字符串是:
'[\xe8\xb4\x9e\xe6\x9d\xa1\xe9\x9a\x9c\xe6\x9b\xb2\xe7\xa0\x8c\xef\xbc\x8c\xe7\xbf\xa0\xe5\x8f\xb6\xe8\xb4\xaf\xe5\xaf\x92\xe9\x9c\x9c\xe3\x80\x82\xe6\x8b\x82\xe7\x89\x96\xe5\x88\x86\xe9\xbe\x99\xe5\xbd\xb1\xef\xbc\x8c\xe4\xb8\xb4\xe6\xb1\xa0\xe5\xbe\x85\xe5\x87\xa4\xe7\xbf\x94\xe3\x80\x82]'
其中,第一个字“贞”,对应的是'\xe8\xb4\x9e',而不是'\xe8'
而这里如果直接用:
for word in poem:
得到的word是'\xe8',这个字实际的编码就丢失了,最后生成的就是乱码
像这样:
橬咑梕兂门鱦伎,庈吜遯关浐尘有。仱比怜敡扲与眲,无南笔蜡鰢恨否。
对应的字符串是:
'\xe6\xa9\xac\xe5\x92\x91\xe6\xa2\x95\xe5\x85\x82\xe9\x97\xa8\xe9\xb1\xa6\xe4\xbc\x8e\xef\xbc\x8c\xe5\xba\x88\xe5\x90\x9c\xe9\x81\xaf\xe5\x85\xb3\xe6\xb5\x90\xe5\xb0\x98\xe6\x9c\x89\xe3\x80\x82\xe4\xbb\xb1\xe6\xaf\x94\xe6\x80\x9c\xe6\x95\xa1\xe6\x89\xb2\xe4\xb8\x8e\xe7\x9c\xb2\xef\xbc\x8c\xe6\x97\xa0\xe5\x8d\x97\xe7\xac\x94\xe8\x9c\xa1\xe9\xb0\xa2\xe6\x81\xa8\xe5\x90\xa6\xe3\x80\x82'
print '\xe6\xa9\xac'.decode('utf8')
得到的是:橬
所以要用
for word in poem.decode('utf8'):
得到的word是u'\u6761'
这样得到的才是一个完整的字
在dataPretreatment.py文件中,有两处要改,同理:
- 在model.py文件中,将'。'等符号也要改成'。'.decode('utf-8')
- 在main.py中,改成characters.decode('utf-8')
- 在model.py中,改for c in characters一句

在输入方面,main.py中input函数改成raw_input,这可能也是Python2和3的区别

训练结果

笔者使用的MacBook Pro不支持GPU训练,因此只能用CPU训练,速度大约是每秒一个step(batch),CPU使用率一下子飙到近600%,温度飙到95-100度,内存压力倒是不大。由于不想让电脑长期发热,笔者只训练了4个epoch,loss还高达4-5,就停止了,之后应该尝试用服务器或GPU训练一下。

结果如下

随机写诗:

藏头诗:

总结

由于只是对编码问题进行了调整,以用于在Python2.7上运行,模型部分完全没有任何改动,参数也没有进行调整和优化,训练次数也很少,所以效果可以说很不理想,这些都是后续需要继续学习和改进的。

尽管如此,对于第一次接触深度学习模型和TensorFlow的笔者而言,这个项目有趣、容易上手,是一次很好的入门尝试。

1
2
3
4
2017-08-11 18:00:51.804169: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-11 18:00:51.804191: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-08-11 18:00:51.804196: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-08-11 18:00:51.804199: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.

记一次与王老师的微信对话,圆点后面是王老师的回复:

https://github.com/zhicongchen/Chinese_poem_generator

看到同学用Python3写的一个唐诗生成器,我改了一下(基本上只有编码问题),可以用2.7运行了,直接clone就可以运行着玩了,不过Mac只能用CPU训练,很捉急,训练的时候电脑会很烫。。。

http://nbviewer.jupyter.org/github/zhicongchen/datalab/blob/master/Getting%20Started%20With%20TensorFlow.ipynb

我觉得这个写诗机器人让我觉得比较有意思的地方在于,它也许可以学习写新闻。
它的原理有点类似于word2vec,我觉得对我们的意义在于,也许可以提供一种把数据建立representation的方法(也就是to vector),从而学到feature

  • 不管是图像还是文本处理,数据都有很强的locality,其它的数据未必就有这种特点
  • 现在看到的tutorials全是文本和图像,感觉只能做这个?
    好像有个说法是DL之所以在文本和图像上大放异彩是因为他们被很好地表示(representation)了,文本可以word2vec,图像本身就是vector
  • 如何将手机数据建立深度学习模型?

我能想到的就是和network representation那些结合,把手机网络给表示出来
(maybe)可以用node2vec,然后用得到的向量作为feature,去预测目标量
因为我们原先做预测的思路一般直接是计算网络中的structrual properties或者最多是根据domain knowledge构造出来的feature,然后去预测目标量
不知道node2vec出来的东西是不是会更牛逼一些,如果更牛的话,按我理解domain knowledge就没用了,不过现阶段可能dl还达不到那么强大

  • 试着做一下这个,使用百度阅读的数据。构建注意力流动网络,node2vec或者deepwalk得到节点的向量,喂给深度学习模型,用来预测图书的销售量和销售额。
  • 我刚才贴那个例子,做回归分析深度学习的框架是可以做的。我不知道神经网络的模型,比如cnn是否可以用来搞

恩恩,我看TensorFlow的tutorial里面也是以Linear Model为例的,和你贴的那个教程差不多

我对这个的理解是:预测的模型还是Linear Model,只是训练方法改用神经网络,而不是常用的最小二乘(OLS),所以我觉得神经网络扮演的角色可能只是训练出来的参数更好些,而模型本身的预测能力并没有什么改善,毕竟它只是一个线性模型

这个好像是用LinearModel+ReLU+LinearModel的神经网络结构来做预测的,看来这是另一种利用DL的思路,所以提升预测效果的重点在于神经网络的设计?

那我刚刚理解有点问题,神经网络的确是可以做预测的,看网络怎么设计了,刚刚第一个例子里只不过网络里只有一个线性模型而已。所以DL的运用的确应该有两种思路

结束后的思考:

node2vec的效果怎么样?如果它已经很完美了,为什么还有那么多人研究NRL(Network Representation Learning)?
NRL研究的难点到底在哪里?

用手机网络的数据试试,node2vec到底有多厉害?

AI & Public Policy, Tsinghua University & University of Chicago, July 22-23, 2017

7.22

Keynote Speech

Xue Lan

Public Policy in Tsinghua: wide range research, set up in 2000

Four major transformations in China since 1979

  • Economic system: planning->market
  • Industrial structure: Agriculture+manufacturing-> manufacturing
  • Society: rural->urban; closed->open
  • governance system: chrisma & authority -> efficiency

R&D Expenditures

  • How is Chinese publications’s quality as it has growed a lot?
  • How to measure it?
  • How to explain the rise of Chinese publications in high quality journals?

use Excellence in Research for Australia journals as a criterion to select journals

  • 知识社会学(Intellectual Sociology)
  • Science of Science - Individual, Citation
  • 文献情报与信息管理 - Journal, Influence, Index

James Evans

Science as a complex system

  • How does science reproduce?
  • How does science evolve?
  • How does science persist?
  • How do fields ignite?

Knowledge Representation: Collocation Hypergraph/Adjacency Tensor

Through representation, everything is close to each other. It’s a small world after all.

Predicting Paper/Patents

Novel(improbable) outcomes: Novelty - 1/P(I|A, S)

Content Novelty & Context Novelty: 0 correlation

  • content: people combining knowledge, pooling concepts
  • context: the community you draw from

Science thinks like a Global Bayesian. Science thinks not like what scientist thinks.

What’s Science’ Objective?

  • Solving the world’s problems.
  • Discovering what it discovers
  • Transforming itself
  • Generating robust, generalizable knowledge?

topic models - a mathematical identical way to realize the paper

Dashun Wang

dashunwang@gmail.com

Predictive Signals Behind Success

Using Social theories, combining mathematical methods
Just like the keynote on IC2S2, 2017

Q: Success can be measured, modeled and predicted?
the collective feature of success
You are successful because all of others think you are successful

Modeling Citation Dynamics: 3 factors

  • Preferential Attachment
  • Aging
  • Intrinsic Novelty

Combine the 3 factors to measure the probability of paper citation and it can be solved analytically
Rescaled Citation and Rescaled Time

Quantifying the evolution of individual scientific impact

Will a scientist produce higher impact work following a major discovery? Hope.

Timing of the hits is high between 0 and 20 years, decays afterwards? Actually it is random! It decays just because their publications decayed. Method: break up the timeline and choose a middle position to observe.

What happends after your biggest hit? Winning begets more winnings

Hot hand phenomenon in artistic, cultural and scientific careers. Biggest vs. Second Biggest hit·

What is the innovation of diffusion? It’s not about adoption, but about substitution.How the substitution looks like? Exponential Growth/ Logistic Growth

  • Handset, Impact: number of handsets sold, every handsets have their own exponential parameters
  • Automobiles,
  • Mobile Apps, Impact: number of downloads

Power law grows much much slower than exponential. What mechanisms are responsible for the observed non-analytic growth?

Understanding Patterns of Substitutions

A Country-wide Mobile Phone Datasets: 3.25M Users, Everyday over 10 years, ~9000 handset models

Metric: Substitution Probability, determined by 3 factors(model)

  • Preferential Return
  • Recency
  • Propensity

3 different systems, determined by the same 3 factors(mechanisms)

Taken together

  • Impact grows as power law with non-integer exponents
  • By exploring large-scale datasets, we find three mechanisms govening substitution patterns
  • We derive Minimal Substitution model, allowing us to not only predict the observed growth pattern, but also to collapse impact trajectories into one universal curve
  • The Minimal Substitution model predicts an intriguing connection between short term impact and long term impact

To finish this work this summer, I hope.

A story about 10%
10% -> 60%, 75 years
Theory by geology, provide innovators more data to uncover the fundamental mechanisms behind it.

Question:

  • Why these 3 factors?
  • Given 3 factors, how did you build the model in that way? How did you evaluate that it works best?
    It is the minimal model we can have according to our citations. After all, we can make sure that the 3 mechanisms work.
    Only by the curve-fitting technology can we find the minimal result.
    赵洪洲

Session 1

Lingfei Wu

Team Science
Small teams create problems and grow attention into future, big teams solve them and harvest. Big teams chase successful works of small teams

Sleeping Beauty Index - PNAS

Dongbo Shi

Funding and Scientific Research: National Science Fund for Distinguished Young Scholars

Yian Yin

The Nature of Repeated Failures

Data has to ‘outlive’ individual careers, NIH datasets

alpha - stiffness, use alpha to build up a model

Each failure-success is a circle

Tao Jia

School of Computer Science, Southwest University

Probing Behavior of Scientists

Quantifying patterns of research-interests evolution, Nature Human Behavior

We are what we repeatedly do. - Aristotle

Big Data -> Activities -> Features -> User Profile
Three Features:

  • Heterogeneity: topic tuple usage in an individual’s career follows a power-law distribution
  • Recency: An individual is more likely to publish on research subjects studied recently
  • Subject Proximity

Model: Scientific research is like a random walk
To what degree could these patterns be captured by a simple statistical model?

MeiJun Liu

Faculty of Education, University of Hong Kong
Age and team of great scientific discoveries in China
On-going work, only some figures presented

Session 2

YongRen Shi

Bots improve human coordination in network experiments
Amazon Mechanical Turk: Online Labor Market - Game on Network - Quantitive Data. breadboard.yale.edu

How can bots accelerate the coordination process?
Every player chooses the best color locally, but the problem was not solved

Kevin Gao

Microsoft Research, NYC, @hb123boy

Conducting human subjects experiments in the ‘virtual lab’
Computational Social Science

In 1950s, people are set in an experimental room to be tested.

Virtual Lab: Bring the lab closer to the real world, using the Internet as a lab

  • Complexity Realism
  • Duration, Participation
  • Size, Scale

TurkServer, built on Meteor web app framework: https://github.com/TurkServer/turkserver-meteor, Crowd Mapper, Andrew Mao, Winter Mason, Siddharth Suri, Duncan Watts

Intertemporal Choice, Kevin Gao, Dan Goldstein

Long-run Cooperation, a very long prisoner’s dilemma experiment, Andrew Mao…Duncan Watts

Han Zhang

PhD candidate in Princeton University, collaborated with Jennifer Pan

Identifying protests from social media data, using deep learning techniques

Training datasets: The Lu dataset, collected by Chinese lawyer Yuyu Lu, from blogspot

Hard task: text are short and meanings are tricky

  • Text: RNN(LSTM)
  • Image: 4-layer-CNN

YuanHao Liu

Officer mobility in China, What factors will influence?

Data Source: Prof. Zhou Xueguang

Factor based to agent based - Casual inference to Sufficient Condition. Fractual Network.

Logic -> Structure: Rich-get-richer and hub-repulse
Not Markov Process, Not Random Walk. Efficient way to fill a space: 3/4 law

7.23

Xingyuan Yuan

RNN - LSTM

  • Nowcasting
  • Machine Translation
  • Music Composer
    No music theory, Representation, loss function, sequence2sequence

Yan Xia

Deep learning in autonomous driving, Momenta

Industrial Thinking: Technology must go first.

Only successful way in industry - Supervised way
Big Data

  • Public: ImageNet
  • Blooming of Internet
    Big Computation
  • GPUs
    Software and Infrastructure(data storage)
  • Git, AWS, Amazon Mechanical Turk(for labeling)

Faster R-CNN, arxiv.org/abs/1506.01497, Fully Convoluntional Networks for Semantic Segmentation

Jiang Zhang

Physians: Your work is so ugly! There are too many parameters in your model.

Map of Complexity Science, by Brian Castellani

Complexity - AI

Why bother a neural network

  • It is a a good predictor
  • It can extract features automatically

Deep Learning fights poverty, Science, remote sensing data

Feature extraction

use CNN(feature extractor) to train a model to predict lightness(already labeled data), use these feature(model - first several layers) to concate another model to predict poverty(transfer learning)

Complext network classifier

use a neural network to classify the complex networks(small-world or scale-free), network representation. Image is most easy to be encoded(represented), text is also solved by word2vec.

Deep walk algorithm - use random walk to generate sequence
What the CNN learns - 2 filters

How to recognize without links - Deep Walk can contain the link information into coodinates.

Deep Learning can be used to solve complex network problems. Can DNN become an expert in complex network?

Yizhuang You

Hyperbolic Network, Boltzmann Machine and Holographic Duality
Popularity vs. Similarity
~ renormalization - field theory
Boltzmann Machine, Hyperbolic Space

Lei Dong

Data-driven urban studies, combined computer science and economics

Former Data Scientist in Baidu, Data Science Company - QuantUrban

What we can do with mobile phone data?

  • Mobile Phone Data and Urban Dynamics, Real-time dynamics
  • Day and Night Population Distribution
  • Mapping Home-Work Connection with Machine Learning, 百度地图 - 常去地点, Rule-based, Label
  • Commuting Data and Visualization
  • Community Detection and City Boundary
  • Population Migration During Spring Festival
  • Spatial-temporal Behaviors and Economics
  • Mobile Internet Coverage and Poverty

Toolkits for social scientists

  • Spider - system, dashboard
  • Mobile Turk - label data

The video on IC2S2

How can Big Data help us understand human behavior, social networks, and success?

http://www.dashunwang.com/ 王大顺

Success is about the future impact of a person, in another words, the only question in our mind is whether he will go up or down in his career. In the past, it is relied on our qualitative judgement. Now we can turn to massive datasets. So let’s try to find a quantitive way to study the formation of success. Can we make success measured, modeled and predicted, just like the way we conquered on natural phenomenon?

Science of Science

  • Robert Merton: Matthew Effect, Singletons and multiples
  • Harriet Zuckerman: Scientific Elites
  • Derek de Solla Price: Invisible College, Power law, Cumulative advantage
  • Thomas Kuhn: Paradigm

Q: Can success be measured, modeled and predicted?

  • the collective feature of success
  • You are successful because all of others think you are successful

Modeling Citation Dynamics

Three generic factors

  • Preferential Attachment
  • Aging
  • Intrinsic Novelty

Combine the 3 factors to measure the probability of paper citation and it can be solved analytically(Rescaled Citation and Rescaled Time)

Quantifying the evolution of individual scientific impact

Will a scientist produce higher impact work following a major discovery? Hope.

Timing of the hits is high between 0 and 20 years, decays afterwards? Actually it is random! It decays just because their publications decayed. Method: break up the timeline and choose a middle position to observe.

What happends after your biggest hit? Winning begets more winnings.

The citation dynamics of paper i follows three parameters

  • Fitness
  • Immediacy
  • Longevity

Hot hand phenomenon in artistic, cultural and scientific careers. Biggest vs. Second Biggest hit·

Innovation: Substitution or Adoption

What is the innovation of diffusion? It’s not about adoption, but about substitution.
How the substitution looks like? Exponential Growth/ Logistic Growth

  • Handset, Impact: number of handsets sold, every handsets have their own exponential parameters
  • Automobiles,
  • Mobile Apps, Impact: number of downloads

Power law grows much much slower than exponential. What mechanisms are responsible for the observed non-analytic growth?

Understanding Patterns of Substitutions

A Country-wide Mobile Phone Datasets: 3.25M Users, Everyday over 10 years, ~9000 handset models

Metric: Substitution Probability, determined by 3 factors(model) - What mechanisms are responsible for the observed non-analytic growth?

  • Preferential Return
  • Recency
  • Propensity

3 different systems, determined by the same 3 factors(mechanisms)

Substitution Patterns(Three parameters)

  • Anticipation
  • Fitness
  • Longevity

Taken together

  • Impact grows as power law with non-integer exponents
  • By exploring large-scale datasets, we find three mechanisms govening substitution patterns
  • We derive Minimal Substitution model, allowing us to not only predict the observed growth pattern, but also to collapse impact trajectories into one universal curve
  • The Minimal Substitution model predicts an intriguing connection between short term impact and long term impact

To finish this work this summer, I hope.

A story about 10%: oil discovery from 10% -> 60% after 75 years. It is caused by improvement of theory by geology, providing innovators more data to uncover the fundamental mechanisms behind it.

Question:

  • Why these 3 factors?
  • Given 3 factors, how did you build the model in that way? How did you evaluate that it works best?
    It is the minimal model we can have according to our citations. After all, we can make sure that the 3 mechanisms work.
    Only by the curve-fitting technology can we find the minimal result.

Reference

  • Sinatra, Wang, Deville and Barabasi, Science, 2016
  • Liu, Wang, Giles, Sinatra, Song and Wang, 2017
  • Jin, Song, Bjelland, Canright and Wang, 2017

Song, C., Qu, Z., Blumm, N., & Barabási, A. L. (2010). Limits of predictability in human mobility. Science, 327(5968), 1018-1021.

Indeed, although we rarely perceive any of our actions to be random, from the perspective of an outside observer who is unaware of our motivations and schedule, our activity pattern can easily appear random and predictable.

Background

At present, the most detailed information on human mobility across a large segment of the population is collected by mobile phone carriers

We assign three entropy measures to each individual’s mobility pattern

  • the random entropy
  • the temporal uncorrelated entropy
  • the actual entropy

removed 5000 users with the highest q from our data set, which ensured that all remaining 45,000 users satisfied q < 0.8.

Data

D1: This anonymized data set represents 14 weeks of call patterns from 10 million mobile phone users (roughly April through June 2007). The data contains the routing tower location each time a user initiates or receives a call or text message. From this information, a user’s trajectory may be reconstructed.

For each user i we define the calling frequency fi as the average number of calls per hour, and the number of locations Ni as the number of distinct towers visited during the three month period.
In order to improve the quality of trajectory reconstruction, we selected 50,000 users with fi >= 0.5 calls/hour and Ni > 2.

D2: Mobile services such as pollen and traffic forecasts rely on the approximate knowledge of customer’s location at all times. For customers voluntarily enrolled in such services, the date, time and the closest tower coordinates are recorded on a regular basis, independent of phone usage. We were provided with the anonymized records of 1,000 such users, from which we selected 100 users whose coordinates were recorded every hour over eight 8 days.

Metrics

Determined Si, Siunc, and Sirand for each user i

True Entropy

Entropy rate: https://en.wikipedia.org/wiki/Entropy_rate

Elements of Information Theory

Lempel-Ziv Compression Algorithm: https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch

http://rosettacode.org/wiki/LZW_compression#Python

This quantity is subject to Fano’s inequality (24, 26).

Regularity

We measured each user’s regularity, R, defined as the probability of finding the user in his most visited location during that hour.

Results

For a user with TTmax = 0.2, this means that at least 80% of the time the individual chooses his location in a manner that appears to be random, and only in the remaining 20% of the time can we hope to predict his or her whereabouts.

To our surprise, we found that P(Pmax) does not follow the fat-tailed distribution suggested by the travel distances, but it is narrowly peaked near Pmax ≈ 0.93 (Fig. 2B).

To reconcile the wide variability in the observed travel distances, we measured the dependency of Pmax on rg,

To determine how much of our predictability is really rooted in the visitation patterns of the top locations, we calculated the probability P ̃ that, in a given moment, the user is in one of the top n most visited locations, where n = 2 typically captures home and work.

Conclusion

It is not unreasonable to expect, therefore, that predictability should also vary widely: For people who travel little, it should be easier to foresee their location, whereas those who regularly cover hundreds of kilometers should have a low predictability. Despite this inherent population heterogeneity, the maximal predictability varies very little—indeed P(Pmax) is narrowly peaked at 93%, and we see no users whose predictability would be under 80%.
Although making explicit predictions on user whereabouts is beyond our goals here, appropri- ate data-mining algorithms (19, 20, 27) could turn the predictability identified in our study into actual mobility predictions.

With the help of Python, we extracted a list of user’s locations in one month with a weight of count, presented as below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{"lat":39.985409,"lng":116.307736,"count":3},
{"lat":39.971303,"lng":116.202289,"count":1},
{"lat":39.965816,"lng":116.267651,"count":35},
{"lat":39.957222,"lng":116.272944,"count":36},
{"lat":39.984621,"lng":116.297391,"count":177},
{"lat":39.970397,"lng":116.269086,"count":45},
{"lat":40.012951,"lng":116.312541,"count":2},
{"lat":39.996136,"lng":116.308325,"count":2},
{"lat":39.989713,"lng":116.308222,"count":2286},
{"lat":39.983464,"lng":116.308398,"count":24},
{"lat":40.005941,"lng":116.313451,"count":1212},
{"lat":39.983056,"lng":116.309167,"count":30},
{"lat":39.981186,"lng":116.302025,"count":5},
{"lat":39.983601,"lng":116.304201,"count":8},
{"lat":39.978361,"lng":116.307881,"count":2},
{"lat":39.960781,"lng":116.293075,"count":41},
{"lat":39.980808,"lng":116.309638,"count":13},
{"lat":39.966862,"lng":116.315672,"count":84},
{"lat":39.973736,"lng":116.312238,"count":31},
{"lat":39.942301,"lng":116.211671,"count":172},
{"lat":39.986081,"lng":116.304411,"count":25},
{"lat":39.959722,"lng":116.201111,"count":23},
{"lat":40.020209,"lng":116.302722,"count":14},
{"lat":39.991325,"lng":116.202392,"count":50},
{"lat":40.026282,"lng":116.298005,"count":69},
{"lat":39.991051,"lng":116.213961,"count":2},
{"lat":39.981536,"lng":116.198601,"count":8},
{"lat":39.939242,"lng":116.237355,"count":24},
{"lat":39.932371,"lng":116.270851,"count":133},
{"lat":39.924745,"lng":116.247945,"count":2},
{"lat":39.919599,"lng":116.259001,"count":309}

Create a new html file referring to the template below, in which the data should be substituted.

http://developer.baidu.com/map/jsdemo.htm#c1_15

Specifically, The user’s key should be your own registered one given by BaiduMap.

1
<script type="text/javascript" src="http://api.map.baidu.com/api?v=2.0&ak=USERKEY"></script>

Open the html file in Chrome, and you can get:

Reference

https://zhuanlan.zhihu.com/p/25845538

Luo, S. et al. Inferring personal economic status from social
network location. Nat. Commun. 8, 15227 doi: 10.1038/ncomms15227 (2017).

Data

The social network is constructed from mobile (calls and SMS metadata) and residential communications data, in another words, CDR(Call Detail Record), collected for a period of 122 days from a Latin American country.

The financial dataset from a major bank in the same country was collected during the same time period as the mobile dataset. The dataset consists of records of the bank clients’ age, gender, credit score, total transaction amount during each billing period, credit limit of each credit card, balance of cards (including debit and credit), zip code of billing address, and encrypted registered phone number.

Metrics

Collective Influence (CI) is an algorithm to identify the most influential nodes via optimal percolation.

CI minimizes the largest eigenvalue of a modified non-backtracking matrix of the network in order to find the minimal set of nodes to disintegrate the network.

CI has advantages in resolution, correlation with wealth, and scalability to massively large social networks.

CI is a concept proposed by

Morone, F. & Makse, H. A. Influence maximization in complex networks through optimal percolation. Nature 524, 65–68 (2015).

Results

Communication Patterns v.s. Economic Status

It is visually apparent that the top 1% (accounting for 45.2% of the total credit in the country) displays a completely different pattern of communication than the bottom 10%; the former is characterized by more active and diverse links, especially connecting remote locations and communicating with other equally affluent people.

The wealthiest 1-percenters have higher diversity in mobile contacts and are centrally located, surrounded by other highly connected people (network hubs). On the other hand, the poorest individuals have low contact diversity and are weakly connected to fewer hubs.

Fraction of Wealthy Individuals v.s. Age and Network Metrics

Correlation between the fraction of wealthy individuals versus age and (a) degree k (R2 = 0.92), (b) k-shell (R2 = 0.96), (c) PageRank (R2 = 0.96) and (d) log10CI (R2 = 0.93).

Further correlations are studied in Supplementary Note 6, indicating that CI could be considered as the most convenient metric out of the four due to its high resolution.

When we combine age and CI quantile ranking into an age-network composite: $ANC = \alpha Age / (1 - \alpha) CI$, with $\alpha = 0.5$, a remarkable correlation (R2 = 0.99, Fig. 3c) is achieved.

今天下午,南卡罗来纳大学魏然教授来到我院交流,我有幸去采访了魏老师,并向其请教了一些关于做研究写论文的困惑,颇有感悟,特此记录。

国内的论文写作偏重论述性,而海外的论文写作,讲究严格的八股:选题、意义、数据、方法、结果等等。

海外的学术研究,强调的是一种公共文化,例如,做期刊的editor往往是没有正式的聘书乃至薪资的,仅仅是一个名誉的头衔,教授们往往是自愿地付出时间和精力参与论文的评审,这一过程实际上是一种学术文化的建设的过程,每一个研究者都应该努力去了解、尊重和参与这一过程,也应该尊重专家付出个人时间给出的评审意见。

关于写论文和投稿等事宜,有没有tricks?肯定是有的,但我更想谈的是一份好的研究应有的元素:其一是极强的问题意识(对研究问题的敏感性——社会意义和重要性),其二是扎实的理论的基础,对于好的研究问题需要研究者能够从传播学或其他相关的理论视角切入,这一过程需要大量的积累),其三是科学严谨的研究方法(无论量化或质化),最后是好的写作习惯。一篇好的论文往往不是想起来写的时候才去写,而一定是每天写一点的。对于研究者而言,必须培养起一个每天写几百字的习惯,记录自己所读、所思、所感,唯有平时一直在积累,等到有征稿通知的时候,就可以较快较好地产出一篇好的论文,而不是在deadline之前匆忙赶出来一篇论文。

一点启示:把平时写日记的习惯升华一下,不仅记一些日常琐事和日常所思,更要培养起日常的学术化写作的习惯,把对学术问题的思考用学术的语言写出来。这一点与最近正在准备的托福考试的写作也是相一致的。

最后老师介绍了一下国际中华传播学会。

海外的传播学研究群体,最早(80年代)主要是韩国人(Korean Association of Communication),香港城市大学的李金铨老师学习韩国人的模式搞了一个CCA(Chinese Communication Association),最初主要是联谊性质的组织,CCA reception,发展壮大以后也开始考虑学术性的服务,扮演一个桥梁的作用,比如回国办一些workshop、建立国内和国外的合作、服务等,构建一个活跃的学术圈(服务性质)。CCA的一大原则是不要只做中国的研究问题,不能因为自己是中国人,就只做中国问题。

采访之前准备采访提纲时,拜读了魏然老师两篇论文:

  • Wei, R. (2014). Texting, tweeting, and talking: Effects of smartphone use on engagement in civic discourse in China. Mobile Media & Communication, 2(1), 3-19.
  • Wei, R., Lo, V. H., Xu, X., Chen, Y. N. K., & Zhang, G. (2014). Predicting mobile news use among college students: The role of press freedom in four Asian cities. new media & society, 16(4), 637-654.

两篇文章都采用量化(问卷调查)方法研究手机的使用行为,第一篇文章主要的发现是与传统的政府管控的媒介相比,智能手机的使用能够有效地提高人们的政治讨论和政治的参与度,其中私人化的政治讨论(talking politics in private)、较高的使用频率(extensive use of the smartphone)以及移动端的微博使用(mobile tweeting)是对线上政治讨论的3个主要的正向影响因素。第二篇文章发现,使用手机阅读新闻和在手机上使用微博类工具的行为在4个被研究的城市(Shanghai, Hong Kong, Taipei and Singapore)中的反映大不相同,出版自由(press freedom)与手机端的新闻使用和微博使用行为呈现负相关关系。

魏然老师简介:

魏然,祖籍河南,现任美国南卡罗莱纳大学新闻与大众传播学院终身教授,博士生导师,广告与公关系主任。1986年毕业于上海外国語大学,主修英文与国际新闻。1990、1995年分别获得英国威尔士大学硕士学位及美国印第安那大学博士学位,曾任中国中央电视台记者、香港中文大学新闻与传播学院担任助理教授、新加坡南洋理工大学传播与信息学院高级访问学者。现任美国《大众传播与社会》(SSCI刊物)副主编,新加坡《亞洲传媒》(SSCI刊物)特約主编,以及5份美国和亚洲的传播类学术刊物编委。国际知名的手机媒体研究专家,多次获得美国新闻与大众传播学科杰出论文奖。中国传媒大学、河南大学客座教授,香港城市大学海外学術评鉴委员,香港大学海外评审委员。

http://smd.sjtu.edu.cn/teacher/detail/id/23

Ran Wei, PhD, is the Gonzales Brothers Professor of Journalism in the School of Journalism & Mass Communications at the University of South Carolina, USA. A former TV journalist, active media consultant, and incoming Editor-in-Chief of Mass Communication & Society, his research focuses on media effects in society and digital new media, including wireless computing and mobile media.

https://www.sc.edu/study/colleges_schools/cic/faculty-staff/wei_ran.phphttps://www.sc.edu/study/colleges_schools/cic/faculty-staff/wei_ran.php

Webster, J. G., & Ksiazek, T. B. (2012). The dynamics of audience fragmentation: Public attention in an age of digital media. Journal of communication, 62(1), 39-56. click here

Abstract

Audience fragmentation is often taken as evidence of social polarization. We offer a theoretical framework for understanding fragmentation and advocate for more audience-centric studies. We find extremely high levels of audience duplication across 236 media outlets, suggesting overlapping patterns of public attention rather than isolated groups of audience loyalists.

Three factors that shape fragmentation

Media Providers

The most obvious cause of fragmentation is a steady growth in the number of media outlets and products competing for public attention.

Media Users

What media users do with all those resources is another matter. Most theorists expect them to choose the media products they prefer. Those preferences might reflect user needs, moods, attitudes, or tastes, but their actions are ‘‘rational’’ in the sense that they serve those psychological predispositions.

Media Measures

Media measures exercise a powerful influence on what users ultimately consume and how providers adapt to and manage those shifting patterns of attendance. Indeed, information regimes can themselves promote or mitigate processes of audience fragmentation

Three different ways of studying fragmentation

Media-centric fragmentation

An increasingly popular way to represent media-centric data is to show them in the form of a long tail (Anderson, 2006).

Concentration can be summarized with any one of several statistics, including Herfindahl–Hirschman indices (HHIs) and Gini coefficients (see Hindman, 2009; Yim, 2003).

Herfindahl–Hirschman indices (HHIs)

The Herfindahl index (also known as Herfindahl–Hirschman Index, or HHI) is a measure of the size of firms in relation to the industry and an indicator of the amount of competition among them. It is defined as the sum of the squares of the market shares of the firms within the industry (sometimes limited to the 50 largest firms), where the market shares are expressed as fractions. The result is proportional to the average market share, weighted by market share. As such, it can range from 0 to 1.0, moving from a huge number of very small firms to a single monopolistic producer.

1
HHI = \sum_{i=1}^{N}(X_i/X)^2 = \sum_{i=1}^{N}S_i^2
  • $X_i$ - 第i个企业的规模
  • $X$ - 市场总规模
  • $S_i$ - 市场占有率
1
codes

Gini Coefficients

The Gini coefficient is a measure of statistical dispersion intended to represent the income or wealth distribution of a nation’s residents, and is the most commonly used measure of inequality. A Gini coefficient of zero expresses perfect equality, where all values are the same (for example, where everyone has the same income). A Gini coefficient of 1 (or 100%) expresses maximal inequality among values (e.g., for a large number of people, where only one person has all the income or consumption, and all others have none, the Gini coefficient will be very nearly one).

if $x_i$ is the wealth or income of person $i$, and there are $n$ persons, then the Gini coefficient $G$ is given by:

1
G = \frac{\sum_{i=1}^{n}\sum_{i=1}^{n}|x_i-x_j|}{2n\sum_{i=1}^{n}x_i}

Results

In Figure 1, the drop-off in cable network attendance is not precipitous, producing an HHI of 144.17, which suggests a modest level of overall concentration.

the HHI for Figure 2 is 173.14, indicating that the use of Internet brands is more concentrated than the use of television channels. Typically, audiences in less abundant media, such as radio and television, are more evenly distributed across outlets(i.e., fragmented) than in media with many choices such as the Internet(Hinderman, 2009;Yim, 2003)

User-centric fragmentation

This approach focuses on each individual’s use of media. It is fragmentation at the microlevel. Most of the literature on selective exposure would suggest that people will become specialized in their patterns of consumption, which is called “media repertories” or “channel repertories”. Most studies focus on explaining the absolute size of repertories, but often say little about their composition.

A user-centric approach has the potential to tell us what a typical user encounters over some period of time. They rarely ‘‘scale-up’’ to the larger issues of how the public allocates its attention across media.

Audience-centric fragmentation

A useful complement to the media- and user-centric approaches described above would be an ‘‘audience-centric’’ approach. This hybrid approach is media-centric in the sense that it describes the audience for particular media outlets. It is user-centric in that it reflects that varied repertories of audience members, which are aggregated into measures that summerize each audience.

A network analytic approach to fragmentation

How to Build Network

The enlarged portion shows the link (i.e., the level of duplication) between a pair of nodes, NBC Affiliates and the Yahoo! brand, where 48.9% of the audience watched NBC and also visited a Yahoo! Web site during March 2009.

  • Node - media
  • Edge - duplication of audience(percent)

the question was how much duplication should be required to declare a link?

expected duplication v.s. observed duplication

Our approach was to compare the observed duplication between two outlets to the ‘‘expected duplication’’ due to chance alone. Expected duplication was determined by multiplying the reach of each outlet. So, for example, if outlet A had a reach of 30% and outlet B a reach of 20%, then 6% of the total audience would be expected to have used each just by chance.1 If the observed duplication exceeded the expected duplication, a link between two outlets was declared present (1); if not, it was absent (0) (see Ksiazek, 2011, for a detailed treatment of this operationalization).

In another words, we can firstly build an attention flow network, then prune those edges whose expected duplication is smaller than observed duplication.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# build attention flow network
def constructFlowNetwork (C):
E=defaultdict(lambda:0)
E[('source',C[0][1])]+=1
E[(C[-1][1],'sink')]+=1
F=zip(C[:-1],C[1:])
for i in F:
if i[0][0]==i[1][0]:
E[(i[0][1],i[1][1])]+=1
else:
E[(i[0][1],'sink')]+=1
E[('source',i[1][1])]+=1
G=nx.DiGraph()
for i,j in E.items():
x,y=i
G.add_edge(x,y,weight=j)
return G
d = np.array(df[['uid', 'book_id']])
G = constructFlowNetwork(d)
for edge in G.edges():
if 'sink' not in edge and 'source' not in edge:
observed_duplication = len(set(df['uid'].iloc[book_groups[edge[0]]]) \
& set(df['uid'].iloc[book_groups[edge[1]]])) / user_sum
expected_duplication = (len(book_groups[edge[0]])/user_sum) * (len(book_groups[edge[1]])/user_sum)
if observed_duplication < expected_duplication:
G.remove_edge(edge[0], edge[1])

Network Metrics

  • degree score: converted into percent, corresponding to nx.degree_centrality()

For each outlet, the number of links is totaled to provide a degree score. For ease of interpretation, we converted these totals to percentages. So, for example, if an outlet had links to all the other 235 outlets, its degree score was 100%. If it had links to 188 outlets, its degree score was 80%.

  • network centralization score, corresponding to closeness_centrality(Freeman, 1979)

To provide a summary measure across the entire network of outlets, we computed a network centralization score. This score summarizes the variability or inequality in the degree scores of all nodes in a given network (Monge & Contractor, 2003) and is roughly analogous to the HHI (see Hindman, 2009; Yim, 2003) that measures concentration in media-centric research. Network centralization scores range from 0% to 100%. In this application, a high score indicates that audiences tend to gravitate to a few outlets (concentration), whereas a low score indicates that audiences spread their attention widely across outlets (fragmentation).

Result

The distribution shows that almost all 236 outlets have high levels of audience duplication with all other outlets(i.e., degree scores close to 100%). Furthermore, the network centralization score is 0.86%. This suggests a high level of equality in degree scores and thus evidence that the audience of any given outlet, popular or not, will overlap with other outlets at a similar level.

For instance, the Internet brand Spike Digital Entertainment reaches only 0.36%5 of the population, but its audience overlaps with close to 70% of the other outlets. Although we do not have data on individual media repertoires, these results suggest that repertoires, though quite varied, have many elements in common. The way users move across the media environment does not seem to produce highly polarized audiences.

The future of audience fragmentation

The myth of enclaves

“Long Tail forces and technologies that are leading to an explosion of variety and abundant choice in the content we consume are also tending to lead us into tribal eddies. When mass culture breaks apart it doesn’t re-form into a different mass. Instead, it turns into millions of microcultures.” Anderson, 2006

Our results indicate that, at least across the 236 outlets we examined, there are very high levels of audience overlap. The people who use any given TV channel or Web site are disproportionately represented in the audience for most other outlets.

All-in-all, there is very little evidence that the typical user spends long periods of time in niches or enclaves of like-minded speech. Alternatively, there is also little evidence that the typical user only consumes hits. Rather, most range widely across the media landscape, a pattern confirmed by the low network centralization score. They may appear in the audience of specialized outlets, but they do not stay long.

That said, neither media-centric nor audience-centric studies on fragmentation provide much evidence of a radical dismembering of society. Although Anderson (2006) can look at long tails and foresee ‘‘the rise of massively parallel culture’’, we doubt that interpretation. That suggests a profusion of media environments that never intersect.

It is more likely that we will have a massively overlapping culture. We think this for two reasons. First, there is growing evidence that despite an abundance of choice, media content tends to be replicated across platforms (e.g., Boczkowski, 2010; Jenkins, 2006; Pew, 2010). Second, while no two people will have identical media repertoires, the chances are they will have much in common. Those points of intersection will be the most popular cultural products, assuming, of course, that popular offerings persist.

The persistence of popularity

Will future audiences distribute themselves evenly across all media choices or will popular offerings continue to dominate the marketplace?

Anderson (2006, p. 181) expects that in a world of infinite choice, ‘‘hit-driven culture’’ will give way to ‘‘ultimate fragmentation.’’ Others believe that ‘‘winner-take-all’’ markets will continue to characterize cultural consumption (e.g., Elberse, 2008; Frank & Cook, 1995).

We are inclined to agree with the latter and offer three arguments why audiences are likely to remain concentrated in the digital media marketplace:

Differential quality of media products

The quality of media products is not uniformly distributed. If prices are not prohibitive, attendance will gravitate to higher quality choices.

First, the pure ‘‘public good’’ nature of digital media makes them easy to reproduce, and often ‘‘free’’ (Anderson, 2009). As Frank and Cook (1995, p. 33) noted

If the best performers’ efforts can be cloned at low marginal cost, there is less room in the market for lower ranked talents.

Second, the increased availability of ‘‘on-demand’’ media promotes this phenomenon. The move to digital video
recorders and downloaded or streamed content makes it simple to avoid the less desirable offerings that were often bundled in linear delivery systems. Consuming a diet of only the best the market has to offer is easier than ever before. This effectively reduces the number of choices and concentrates attention on those options.(Recommandation System)

The social desirability of media selections

Media have long served as a ‘‘coin-of-exchange’’ in social situations (Levy & Windahl, 1984). A few programs, sporting events, or clips on YouTube are the stuff of water-cooler conversations, which encourages those who
want to join the discussion to see what everyone else is talking about.

The advent of social media, such as Facebook and Twitter, may well extend these conversations to virtual spaces and focus the attention of those networks on what they find noteworthy. Often this will be popular, event-driven programming.

Recent studies on simultaneous media use during the 2010 Super Bowl and opening ceremonies of the Winter Olympics suggest that individuals use social media to discuss these events as they watch TV (NielsenWire, 2010, February 12; 2010, February 19).

The media measures that inform user choices

Because digital media are abundant and the products involved are experience goods, users depend on recommendation systems to guide their consumption. Although search and recommendation
algorithms vary, most direct attention to popular products or outlets(Webster,2010).

The more salient that user information, the more markets are inclined to produce winner-take-all results, although the actual winners are impossible to predict before the process begins. Under such circumstances, the ‘‘wisdom of crowds’’ (Surowiecki, 2004) may not be a reliable measure of quality, but it concentrates public attention nonetheless.

Conclusion

The persistence of popularity, and the inclination of providers to imitate what is popular, suggests that audiences will not spin off in all directions. Although the ongoing production of media by professionals and amateurs alike will grow the long tail ever longer, that does not mean endless fragmentation. Most niche media will be doomed to obscurity and the few who pay a visit will spend little time there.

Rather, users will range widely across media outlets, devoting much of their attention to the most salient offerings. Those objects of public attention will undoubtedly be more varied than in the past. They will often, though not always, be the best of their kind. They will be the media people talk about with friends and share via social networks. Their visibility and meaning may vary across the culture, but they will constitute the stuff of a common, twenty-first-century cultural forum.