AI & Public Policy, July 22-23, 2017

Tsinghua University & University of Chicago

7.22

Keynote Speech

Xue Lan

Public Policy in Tsinghua: wide range research, set up in 2000

Four major transformations in China since 1979

  • Economic system: planning->market
  • Industrial structure: Agriculture+manufacturing-> manufacturing
  • Society: rural->urban; closed->open
  • governance system: chrisma & authority -> efficiency

R&D Expenditures

  • How is Chinese publications’s quality as it has growed a lot?
  • How to measure it?
  • How to explain the rise of Chinese publications in high quality journals?

use Excellence in Research for Australia journals as a criterion to select journals

  • 知识社会学(Intellectual Sociology)
  • Science of Science - Individual, Citation
  • 文献情报与信息管理 - Journal, Influence, Index

James Evans

Science as a complex system

  • How does science reproduce?
  • How does science evolve?
  • How does science persist?
  • How do fields ignite?

Knowledge Representation: Collocation Hypergraph/Adjacency Tensor

Through representation, everything is close to each other. It’s a small world after all.

Predicting Paper/Patents

Novel(improbable) outcomes: Novelty - 1/P(I|A, S)

Content Novelty & Context Novelty: 0 correlation

  • content: people combining knowledge, pooling concepts
  • context: the community you draw from

Science thinks like a Global Bayesian. Science thinks not like what scientist thinks.

What’s Science’ Objective?

  • Solving the world’s problems.
  • Discovering what it discovers
  • Transforming itself
  • Generating robust, generalizable knowledge?

topic models - a mathematical identical way to realize the paper

Dashun Wang

dashunwang@gmail.com

Predictive Signals Behind Success

Using Social theories, combining mathematical methods
Just like the keynote on IC2S2, 2017

Q: Success can be measured, modeled and predicted?
the collective feature of success
You are successful because all of others think you are successful

Modeling Citation Dynamics: 3 factors

  • Preferential Attachment
  • Aging
  • Intrinsic Novelty

Combine the 3 factors to measure the probability of paper citation and it can be solved analytically
Rescaled Citation and Rescaled Time

Quantifying the evolution of individual scientific impact

Will a scientist produce higher impact work following a major discovery? Hope.

Timing of the hits is high between 0 and 20 years, decays afterwards? Actually it is random! It decays just because their publications decayed. Method: break up the timeline and choose a middle position to observe.

What happends after your biggest hit? Winning begets more winnings

Hot hand phenomenon in artistic, cultural and scientific careers. Biggest vs. Second Biggest hit·

What is the innovation of diffusion? It’s not about adoption, but about substitution.How the substitution looks like? Exponential Growth/ Logistic Growth

  • Handset, Impact: number of handsets sold, every handsets have their own exponential parameters
  • Automobiles,
  • Mobile Apps, Impact: number of downloads

Power law grows much much slower than exponential. What mechanisms are responsible for the observed non-analytic growth?

Understanding Patterns of Substitutions

A Country-wide Mobile Phone Datasets: 3.25M Users, Everyday over 10 years, ~9000 handset models

Metric: Substitution Probability, determined by 3 factors(model)

  • Preferential Return
  • Recency
  • Propensity

3 different systems, determined by the same 3 factors(mechanisms)

Taken together

  • Impact grows as power law with non-integer exponents
  • By exploring large-scale datasets, we find three mechanisms govening substitution patterns
  • We derive Minimal Substitution model, allowing us to not only predict the observed growth pattern, but also to collapse impact trajectories into one universal curve
  • The Minimal Substitution model predicts an intriguing connection between short term impact and long term impact

To finish this work this summer, I hope.

A story about 10%
10% -> 60%, 75 years
Theory by geology, provide innovators more data to uncover the fundamental mechanisms behind it.

Question:

  • Why these 3 factors?
  • Given 3 factors, how did you build the model in that way? How did you evaluate that it works best?
    It is the minimal model we can have according to our citations. After all, we can make sure that the 3 mechanisms work.
    Only by the curve-fitting technology can we find the minimal result.
    赵洪洲

Session 1

Lingfei Wu

Team Science
Small teams create problems and grow attention into future, big teams solve them and harvest. Big teams chase successful works of small teams

Sleeping Beauty Index - PNAS

Dongbo Shi

Funding and Scientific Research: National Science Fund for Distinguished Young Scholars

Yian Yin

The Nature of Repeated Failures

Data has to ‘outlive’ individual careers, NIH datasets

alpha - stiffness, use alpha to build up a model

Each failure-success is a circle

Tao Jia

School of Computer Science, Southwest University

Probing Behavior of Scientists

Quantifying patterns of research-interests evolution, Nature Human Behavior

We are what we repeatedly do. - Aristotle

Big Data -> Activities -> Features -> User Profile
Three Features:

  • Heterogeneity: topic tuple usage in an individual’s career follows a power-law distribution
  • Recency: An individual is more likely to publish on research subjects studied recently
  • Subject Proximity

Model: Scientific research is like a random walk
To what degree could these patterns be captured by a simple statistical model?

MeiJun Liu

Faculty of Education, University of Hong Kong
Age and team of great scientific discoveries in China
On-going work, only some figures presented

Session 2

YongRen Shi

Bots improve human coordination in network experiments
Amazon Mechanical Turk: Online Labor Market - Game on Network - Quantitive Data. breadboard.yale.edu

How can bots accelerate the coordination process?
Every player chooses the best color locally, but the problem was not solved

Kevin Gao

Microsoft Research, NYC, @hb123boy

Conducting human subjects experiments in the ‘virtual lab’
Computational Social Science

In 1950s, people are set in an experimental room to be tested.

Virtual Lab: Bring the lab closer to the real world, using the Internet as a lab

  • Complexity Realism
  • Duration, Participation
  • Size, Scale

TurkServer, built on Meteor web app framework: https://github.com/TurkServer/turkserver-meteor, Crowd Mapper, Andrew Mao, Winter Mason, Siddharth Suri, Duncan Watts

Intertemporal Choice, Kevin Gao, Dan Goldstein

Long-run Cooperation, a very long prisoner’s dilemma experiment, Andrew Mao…Duncan Watts

Han Zhang

PhD candidate in Princeton University, collaborated with Jennifer Pan

Identifying protests from social media data, using deep learning techniques

Training datasets: The Lu dataset, collected by Chinese lawyer Yuyu Lu, from blogspot

Hard task: text are short and meanings are tricky

  • Text: RNN(LSTM)
  • Image: 4-layer-CNN

YuanHao Liu

Officer mobility in China, What factors will influence?

Data Source: Prof. Zhou Xueguang

Factor based to agent based - Casual inference to Sufficient Condition. Fractual Network.

Logic -> Structure: Rich-get-richer and hub-repulse
Not Markov Process, Not Random Walk. Efficient way to fill a space: 3/4 law

7.23

Xingyuan Yuan

RNN - LSTM

  • Nowcasting
  • Machine Translation
  • Music Composer
    No music theory, Representation, loss function, sequence2sequence

Yan Xia

Deep learning in autonomous driving, Momenta

Industrial Thinking: Technology must go first.

Only successful way in industry - Supervised way
Big Data

  • Public: ImageNet
  • Blooming of Internet
    Big Computation
  • GPUs
    Software and Infrastructure(data storage)
  • Git, AWS, Amazon Mechanical Turk(for labeling)

Faster R-CNN, arxiv.org/abs/1506.01497, Fully Convoluntional Networks for Semantic Segmentation

Jiang Zhang

Physians: Your work is so ugly! There are too many parameters in your model.

Map of Complexity Science, by Brian Castellani

Complexity - AI

Why bother a neural network

  • It is a a good predictor
  • It can extract features automatically

Deep Learning fights poverty, Science, remote sensing data

Feature extraction

use CNN(feature extractor) to train a model to predict lightness(already labeled data), use these feature(model - first several layers) to concate another model to predict poverty(transfer learning)

Complext network classifier

use a neural network to classify the complex networks(small-world or scale-free), network representation. Image is most easy to be encoded(represented), text is also solved by word2vec.

Deep walk algorithm - use random walk to generate sequence
What the CNN learns - 2 filters

How to recognize without links - Deep Walk can contain the link information into coodinates.

Deep Learning can be used to solve complex network problems. Can DNN become an expert in complex network?

Yizhuang You

Hyperbolic Network, Boltzmann Machine and Holographic Duality
Popularity vs. Similarity
~ renormalization - field theory
Boltzmann Machine, Hyperbolic Space

Lei Dong

Data-driven urban studies, combined computer science and economics

Former Data Scientist in Baidu, Data Science Company - QuantUrban

What we can do with mobile phone data?

  • Mobile Phone Data and Urban Dynamics, Real-time dynamics
  • Day and Night Population Distribution
  • Mapping Home-Work Connection with Machine Learning, 百度地图 - 常去地点, Rule-based, Label
  • Commuting Data and Visualization
  • Community Detection and City Boundary
  • Population Migration During Spring Festival
  • Spatial-temporal Behaviors and Economics
  • Mobile Internet Coverage and Poverty

Toolkits for social scientists

  • Spider - system, dashboard
  • Mobile Turk - label data

The video on IC2S2

How can Big Data help us understand human behavior, social networks, and success?

http://www.dashunwang.com/ 王大顺

Success is about the future impact of a person, in another words, the only question in our mind is whether he will go up or down in his career. In the past, it is relied on our qualitative judgement. Now we can turn to massive datasets. So let’s try to find a quantitive way to study the formation of success. Can we make success measured, modeled and predicted, just like the way we conquered on natural phenomenon?

Science of Science

  • Robert Merton: Matthew Effect, Singletons and multiples
  • Harriet Zuckerman: Scientific Elites
  • Derek de Solla Price: Invisible College, Power law, Cumulative advantage
  • Thomas Kuhn: Paradigm

Q: Can success be measured, modeled and predicted?

  • the collective feature of success
  • You are successful because all of others think you are successful

Modeling Citation Dynamics

Three generic factors

  • Preferential Attachment
  • Aging
  • Intrinsic Novelty

Combine the 3 factors to measure the probability of paper citation and it can be solved analytically(Rescaled Citation and Rescaled Time)

Quantifying the evolution of individual scientific impact

Will a scientist produce higher impact work following a major discovery? Hope.

Timing of the hits is high between 0 and 20 years, decays afterwards? Actually it is random! It decays just because their publications decayed. Method: break up the timeline and choose a middle position to observe.

What happends after your biggest hit? Winning begets more winnings.

The citation dynamics of paper i follows three parameters

  • Fitness
  • Immediacy
  • Longevity

Hot hand phenomenon in artistic, cultural and scientific careers. Biggest vs. Second Biggest hit·

Innovation: Substitution or Adoption

What is the innovation of diffusion? It’s not about adoption, but about substitution.
How the substitution looks like? Exponential Growth/ Logistic Growth

  • Handset, Impact: number of handsets sold, every handsets have their own exponential parameters
  • Automobiles,
  • Mobile Apps, Impact: number of downloads

Power law grows much much slower than exponential. What mechanisms are responsible for the observed non-analytic growth?

Understanding Patterns of Substitutions

A Country-wide Mobile Phone Datasets: 3.25M Users, Everyday over 10 years, ~9000 handset models

Metric: Substitution Probability, determined by 3 factors(model) - What mechanisms are responsible for the observed non-analytic growth?

  • Preferential Return
  • Recency
  • Propensity

3 different systems, determined by the same 3 factors(mechanisms)

Substitution Patterns(Three parameters)

  • Anticipation
  • Fitness
  • Longevity

Taken together

  • Impact grows as power law with non-integer exponents
  • By exploring large-scale datasets, we find three mechanisms govening substitution patterns
  • We derive Minimal Substitution model, allowing us to not only predict the observed growth pattern, but also to collapse impact trajectories into one universal curve
  • The Minimal Substitution model predicts an intriguing connection between short term impact and long term impact

To finish this work this summer, I hope.

A story about 10%: oil discovery from 10% -> 60% after 75 years. It is caused by improvement of theory by geology, providing innovators more data to uncover the fundamental mechanisms behind it.

Question:

  • Why these 3 factors?
  • Given 3 factors, how did you build the model in that way? How did you evaluate that it works best?
    It is the minimal model we can have according to our citations. After all, we can make sure that the 3 mechanisms work.
    Only by the curve-fitting technology can we find the minimal result.

Reference

  • Sinatra, Wang, Deville and Barabasi, Science, 2016
  • Liu, Wang, Giles, Sinatra, Song and Wang, 2017
  • Jin, Song, Bjelland, Canright and Wang, 2017

Song, C., Qu, Z., Blumm, N., & Barabási, A. L. (2010). Limits of predictability in human mobility. Science, 327(5968), 1018-1021.

Indeed, although we rarely perceive any of our actions to be random, from the perspective of an outside observer who is unaware of our motivations and schedule, our activity pattern can easily appear random and predictable.

Background

At present, the most detailed information on human mobility across a large segment of the population is collected by mobile phone carriers

We assign three entropy measures to each individual’s mobility pattern

  • the random entropy
  • the temporal uncorrelated entropy
  • the actual entropy

removed 5000 users with the highest q from our data set, which ensured that all remaining 45,000 users satisfied q < 0.8.

Data

D1: This anonymized data set represents 14 weeks of call patterns from 10 million mobile phone users (roughly April through June 2007). The data contains the routing tower location each time a user initiates or receives a call or text message. From this information, a user’s trajectory may be reconstructed.

For each user i we define the calling frequency fi as the average number of calls per hour, and the number of locations Ni as the number of distinct towers visited during the three month period.
In order to improve the quality of trajectory reconstruction, we selected 50,000 users with fi >= 0.5 calls/hour and Ni > 2.

D2: Mobile services such as pollen and traffic forecasts rely on the approximate knowledge of customer’s location at all times. For customers voluntarily enrolled in such services, the date, time and the closest tower coordinates are recorded on a regular basis, independent of phone usage. We were provided with the anonymized records of 1,000 such users, from which we selected 100 users whose coordinates were recorded every hour over eight 8 days.

Metrics

Determined Si, Siunc, and Sirand for each user i

True Entropy

Entropy rate: https://en.wikipedia.org/wiki/Entropy_rate

Elements of Information Theory

Lempel-Ziv Compression Algorithm: https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch

http://rosettacode.org/wiki/LZW_compression#Python

This quantity is subject to Fano’s inequality (24, 26).

Regularity

We measured each user’s regularity, R, defined as the probability of finding the user in his most visited location during that hour.

Results

For a user with TTmax = 0.2, this means that at least 80% of the time the individual chooses his location in a manner that appears to be random, and only in the remaining 20% of the time can we hope to predict his or her whereabouts.

To our surprise, we found that P(Pmax) does not follow the fat-tailed distribution suggested by the travel distances, but it is narrowly peaked near Pmax ≈ 0.93 (Fig. 2B).

To reconcile the wide variability in the observed travel distances, we measured the dependency of Pmax on rg,

To determine how much of our predictability is really rooted in the visitation patterns of the top locations, we calculated the probability P ̃ that, in a given moment, the user is in one of the top n most visited locations, where n = 2 typically captures home and work.

Conclusion

It is not unreasonable to expect, therefore, that predictability should also vary widely: For people who travel little, it should be easier to foresee their location, whereas those who regularly cover hundreds of kilometers should have a low predictability. Despite this inherent population heterogeneity, the maximal predictability varies very little—indeed P(Pmax) is narrowly peaked at 93%, and we see no users whose predictability would be under 80%.
Although making explicit predictions on user whereabouts is beyond our goals here, appropri- ate data-mining algorithms (19, 20, 27) could turn the predictability identified in our study into actual mobility predictions.

With the help of Python, we extracted a list of user’s locations in one month with a weight of count, presented as below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{"lat":39.985409,"lng":116.307736,"count":3},
{"lat":39.971303,"lng":116.202289,"count":1},
{"lat":39.965816,"lng":116.267651,"count":35},
{"lat":39.957222,"lng":116.272944,"count":36},
{"lat":39.984621,"lng":116.297391,"count":177},
{"lat":39.970397,"lng":116.269086,"count":45},
{"lat":40.012951,"lng":116.312541,"count":2},
{"lat":39.996136,"lng":116.308325,"count":2},
{"lat":39.989713,"lng":116.308222,"count":2286},
{"lat":39.983464,"lng":116.308398,"count":24},
{"lat":40.005941,"lng":116.313451,"count":1212},
{"lat":39.983056,"lng":116.309167,"count":30},
{"lat":39.981186,"lng":116.302025,"count":5},
{"lat":39.983601,"lng":116.304201,"count":8},
{"lat":39.978361,"lng":116.307881,"count":2},
{"lat":39.960781,"lng":116.293075,"count":41},
{"lat":39.980808,"lng":116.309638,"count":13},
{"lat":39.966862,"lng":116.315672,"count":84},
{"lat":39.973736,"lng":116.312238,"count":31},
{"lat":39.942301,"lng":116.211671,"count":172},
{"lat":39.986081,"lng":116.304411,"count":25},
{"lat":39.959722,"lng":116.201111,"count":23},
{"lat":40.020209,"lng":116.302722,"count":14},
{"lat":39.991325,"lng":116.202392,"count":50},
{"lat":40.026282,"lng":116.298005,"count":69},
{"lat":39.991051,"lng":116.213961,"count":2},
{"lat":39.981536,"lng":116.198601,"count":8},
{"lat":39.939242,"lng":116.237355,"count":24},
{"lat":39.932371,"lng":116.270851,"count":133},
{"lat":39.924745,"lng":116.247945,"count":2},
{"lat":39.919599,"lng":116.259001,"count":309}

Create a new html file referring to the template below, in which the data should be substituted.

http://developer.baidu.com/map/jsdemo.htm#c1_15

Specifically, The user’s key should be your own registered one given by BaiduMap.

1
<script type="text/javascript" src="http://api.map.baidu.com/api?v=2.0&ak=USERKEY"></script>

Open the html file in Chrome, and you can get:

Reference

https://zhuanlan.zhihu.com/p/25845538

Luo, S. et al. Inferring personal economic status from social
network location. Nat. Commun. 8, 15227 doi: 10.1038/ncomms15227 (2017).

Data

The social network is constructed from mobile (calls and SMS metadata) and residential communications data, in another words, CDR(Call Detail Record), collected for a period of 122 days from a Latin American country.

The financial dataset from a major bank in the same country was collected during the same time period as the mobile dataset. The dataset consists of records of the bank clients’ age, gender, credit score, total transaction amount during each billing period, credit limit of each credit card, balance of cards (including debit and credit), zip code of billing address, and encrypted registered phone number.

Metrics

Collective Influence (CI) is an algorithm to identify the most influential nodes via optimal percolation.

CI minimizes the largest eigenvalue of a modified non-backtracking matrix of the network in order to find the minimal set of nodes to disintegrate the network.

CI has advantages in resolution, correlation with wealth, and scalability to massively large social networks.

CI is a concept proposed by

Morone, F. & Makse, H. A. Influence maximization in complex networks through optimal percolation. Nature 524, 65–68 (2015).

Results

Communication Patterns v.s. Economic Status

It is visually apparent that the top 1% (accounting for 45.2% of the total credit in the country) displays a completely different pattern of communication than the bottom 10%; the former is characterized by more active and diverse links, especially connecting remote locations and communicating with other equally affluent people.

The wealthiest 1-percenters have higher diversity in mobile contacts and are centrally located, surrounded by other highly connected people (network hubs). On the other hand, the poorest individuals have low contact diversity and are weakly connected to fewer hubs.

Fraction of Wealthy Individuals v.s. Age and Network Metrics

Correlation between the fraction of wealthy individuals versus age and (a) degree k (R2 = 0.92), (b) k-shell (R2 = 0.96), (c) PageRank (R2 = 0.96) and (d) log10CI (R2 = 0.93).

Further correlations are studied in Supplementary Note 6, indicating that CI could be considered as the most convenient metric out of the four due to its high resolution.

When we combine age and CI quantile ranking into an age-network composite: $ANC = \alpha Age / (1 - \alpha) CI$, with $\alpha = 0.5$, a remarkable correlation (R2 = 0.99, Fig. 3c) is achieved.

今天下午,南卡罗来纳大学魏然教授来到我院交流,我有幸去采访了魏老师,并向其请教了一些关于做研究写论文的困惑,颇有感悟,特此记录。

国内的论文写作偏重论述性,而海外的论文写作,讲究严格的八股:选题、意义、数据、方法、结果等等。

海外的学术研究,强调的是一种公共文化,例如,做期刊的editor往往是没有正式的聘书乃至薪资的,仅仅是一个名誉的头衔,教授们往往是自愿地付出时间和精力参与论文的评审,这一过程实际上是一种学术文化的建设的过程,每一个研究者都应该努力去了解、尊重和参与这一过程,也应该尊重专家付出个人时间给出的评审意见。

关于写论文和投稿等事宜,有没有tricks?肯定是有的,但我更想谈的是一份好的研究应有的元素:其一是极强的问题意识(对研究问题的敏感性——社会意义和重要性),其二是扎实的理论的基础,对于好的研究问题需要研究者能够从传播学或其他相关的理论视角切入,这一过程需要大量的积累),其三是科学严谨的研究方法(无论量化或质化),最后是好的写作习惯。一篇好的论文往往不是想起来写的时候才去写,而一定是每天写一点的。对于研究者而言,必须培养起一个每天写几百字的习惯,记录自己所读、所思、所感,唯有平时一直在积累,等到有征稿通知的时候,就可以较快较好地产出一篇好的论文,而不是在deadline之前匆忙赶出来一篇论文。

一点启示:把平时写日记的习惯升华一下,不仅记一些日常琐事和日常所思,更要培养起日常的学术化写作的习惯,把对学术问题的思考用学术的语言写出来。这一点与最近正在准备的托福考试的写作也是相一致的。

最后老师介绍了一下国际中华传播学会。

海外的传播学研究群体,最早(80年代)主要是韩国人(Korean Association of Communication),香港城市大学的李金铨老师学习韩国人的模式搞了一个CCA(Chinese Communication Association),最初主要是联谊性质的组织,CCA reception,发展壮大以后也开始考虑学术性的服务,扮演一个桥梁的作用,比如回国办一些workshop、建立国内和国外的合作、服务等,构建一个活跃的学术圈(服务性质)。CCA的一大原则是不要只做中国的研究问题,不能因为自己是中国人,就只做中国问题。

采访之前准备采访提纲时,拜读了魏然老师两篇论文:

  • Wei, R. (2014). Texting, tweeting, and talking: Effects of smartphone use on engagement in civic discourse in China. Mobile Media & Communication, 2(1), 3-19.
  • Wei, R., Lo, V. H., Xu, X., Chen, Y. N. K., & Zhang, G. (2014). Predicting mobile news use among college students: The role of press freedom in four Asian cities. new media & society, 16(4), 637-654.

两篇文章都采用量化(问卷调查)方法研究手机的使用行为,第一篇文章主要的发现是与传统的政府管控的媒介相比,智能手机的使用能够有效地提高人们的政治讨论和政治的参与度,其中私人化的政治讨论(talking politics in private)、较高的使用频率(extensive use of the smartphone)以及移动端的微博使用(mobile tweeting)是对线上政治讨论的3个主要的正向影响因素。第二篇文章发现,使用手机阅读新闻和在手机上使用微博类工具的行为在4个被研究的城市(Shanghai, Hong Kong, Taipei and Singapore)中的反映大不相同,出版自由(press freedom)与手机端的新闻使用和微博使用行为呈现负相关关系。

魏然老师简介:

魏然,祖籍河南,现任美国南卡罗莱纳大学新闻与大众传播学院终身教授,博士生导师,广告与公关系主任。1986年毕业于上海外国語大学,主修英文与国际新闻。1990、1995年分别获得英国威尔士大学硕士学位及美国印第安那大学博士学位,曾任中国中央电视台记者、香港中文大学新闻与传播学院担任助理教授、新加坡南洋理工大学传播与信息学院高级访问学者。现任美国《大众传播与社会》(SSCI刊物)副主编,新加坡《亞洲传媒》(SSCI刊物)特約主编,以及5份美国和亚洲的传播类学术刊物编委。国际知名的手机媒体研究专家,多次获得美国新闻与大众传播学科杰出论文奖。中国传媒大学、河南大学客座教授,香港城市大学海外学術评鉴委员,香港大学海外评审委员。

http://smd.sjtu.edu.cn/teacher/detail/id/23

Ran Wei, PhD, is the Gonzales Brothers Professor of Journalism in the School of Journalism & Mass Communications at the University of South Carolina, USA. A former TV journalist, active media consultant, and incoming Editor-in-Chief of Mass Communication & Society, his research focuses on media effects in society and digital new media, including wireless computing and mobile media.

https://www.sc.edu/study/colleges_schools/cic/faculty-staff/wei_ran.phphttps://www.sc.edu/study/colleges_schools/cic/faculty-staff/wei_ran.php

Webster, J. G., & Ksiazek, T. B. (2012). The dynamics of audience fragmentation: Public attention in an age of digital media. Journal of communication, 62(1), 39-56. click here

Abstract

Audience fragmentation is often taken as evidence of social polarization. We offer a theoretical framework for understanding fragmentation and advocate for more audience-centric studies. We find extremely high levels of audience duplication across 236 media outlets, suggesting overlapping patterns of public attention rather than isolated groups of audience loyalists.

Three factors that shape fragmentation

Media Providers

The most obvious cause of fragmentation is a steady growth in the number of media outlets and products competing for public attention.

Media Users

What media users do with all those resources is another matter. Most theorists expect them to choose the media products they prefer. Those preferences might reflect user needs, moods, attitudes, or tastes, but their actions are ‘‘rational’’ in the sense that they serve those psychological predispositions.

Media Measures

Media measures exercise a powerful influence on what users ultimately consume and how providers adapt to and manage those shifting patterns of attendance. Indeed, information regimes can themselves promote or mitigate processes of audience fragmentation

Three different ways of studying fragmentation

Media-centric fragmentation

An increasingly popular way to represent media-centric data is to show them in the form of a long tail (Anderson, 2006).

Concentration can be summarized with any one of several statistics, including Herfindahl–Hirschman indices (HHIs) and Gini coefficients (see Hindman, 2009; Yim, 2003).

Herfindahl–Hirschman indices (HHIs)

The Herfindahl index (also known as Herfindahl–Hirschman Index, or HHI) is a measure of the size of firms in relation to the industry and an indicator of the amount of competition among them. It is defined as the sum of the squares of the market shares of the firms within the industry (sometimes limited to the 50 largest firms), where the market shares are expressed as fractions. The result is proportional to the average market share, weighted by market share. As such, it can range from 0 to 1.0, moving from a huge number of very small firms to a single monopolistic producer.

1
HHI = \sum_{i=1}^{N}(X_i/X)^2 = \sum_{i=1}^{N}S_i^2
  • $X_i$ - 第i个企业的规模
  • $X$ - 市场总规模
  • $S_i$ - 市场占有率
1
codes

Gini Coefficients

The Gini coefficient is a measure of statistical dispersion intended to represent the income or wealth distribution of a nation’s residents, and is the most commonly used measure of inequality. A Gini coefficient of zero expresses perfect equality, where all values are the same (for example, where everyone has the same income). A Gini coefficient of 1 (or 100%) expresses maximal inequality among values (e.g., for a large number of people, where only one person has all the income or consumption, and all others have none, the Gini coefficient will be very nearly one).

if $x_i$ is the wealth or income of person $i$, and there are $n$ persons, then the Gini coefficient $G$ is given by:

1
G = \frac{\sum_{i=1}^{n}\sum_{i=1}^{n}|x_i-x_j|}{2n\sum_{i=1}^{n}x_i}

Results

In Figure 1, the drop-off in cable network attendance is not precipitous, producing an HHI of 144.17, which suggests a modest level of overall concentration.

the HHI for Figure 2 is 173.14, indicating that the use of Internet brands is more concentrated than the use of television channels. Typically, audiences in less abundant media, such as radio and television, are more evenly distributed across outlets(i.e., fragmented) than in media with many choices such as the Internet(Hinderman, 2009;Yim, 2003)

User-centric fragmentation

This approach focuses on each individual’s use of media. It is fragmentation at the microlevel. Most of the literature on selective exposure would suggest that people will become specialized in their patterns of consumption, which is called “media repertories” or “channel repertories”. Most studies focus on explaining the absolute size of repertories, but often say little about their composition.

A user-centric approach has the potential to tell us what a typical user encounters over some period of time. They rarely ‘‘scale-up’’ to the larger issues of how the public allocates its attention across media.

Audience-centric fragmentation

A useful complement to the media- and user-centric approaches described above would be an ‘‘audience-centric’’ approach. This hybrid approach is media-centric in the sense that it describes the audience for particular media outlets. It is user-centric in that it reflects that varied repertories of audience members, which are aggregated into measures that summerize each audience.

A network analytic approach to fragmentation

How to Build Network

The enlarged portion shows the link (i.e., the level of duplication) between a pair of nodes, NBC Affiliates and the Yahoo! brand, where 48.9% of the audience watched NBC and also visited a Yahoo! Web site during March 2009.

  • Node - media
  • Edge - duplication of audience(percent)

the question was how much duplication should be required to declare a link?

expected duplication v.s. observed duplication

Our approach was to compare the observed duplication between two outlets to the ‘‘expected duplication’’ due to chance alone. Expected duplication was determined by multiplying the reach of each outlet. So, for example, if outlet A had a reach of 30% and outlet B a reach of 20%, then 6% of the total audience would be expected to have used each just by chance.1 If the observed duplication exceeded the expected duplication, a link between two outlets was declared present (1); if not, it was absent (0) (see Ksiazek, 2011, for a detailed treatment of this operationalization).

In another words, we can firstly build an attention flow network, then prune those edges whose expected duplication is smaller than observed duplication.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# build attention flow network
def constructFlowNetwork (C):
E=defaultdict(lambda:0)
E[('source',C[0][1])]+=1
E[(C[-1][1],'sink')]+=1
F=zip(C[:-1],C[1:])
for i in F:
if i[0][0]==i[1][0]:
E[(i[0][1],i[1][1])]+=1
else:
E[(i[0][1],'sink')]+=1
E[('source',i[1][1])]+=1
G=nx.DiGraph()
for i,j in E.items():
x,y=i
G.add_edge(x,y,weight=j)
return G
d = np.array(df[['uid', 'book_id']])
G = constructFlowNetwork(d)
for edge in G.edges():
if 'sink' not in edge and 'source' not in edge:
observed_duplication = len(set(df['uid'].iloc[book_groups[edge[0]]]) \
& set(df['uid'].iloc[book_groups[edge[1]]])) / user_sum
expected_duplication = (len(book_groups[edge[0]])/user_sum) * (len(book_groups[edge[1]])/user_sum)
if observed_duplication < expected_duplication:
G.remove_edge(edge[0], edge[1])

Network Metrics

  • degree score: converted into percent, corresponding to nx.degree_centrality()

For each outlet, the number of links is totaled to provide a degree score. For ease of interpretation, we converted these totals to percentages. So, for example, if an outlet had links to all the other 235 outlets, its degree score was 100%. If it had links to 188 outlets, its degree score was 80%.

  • network centralization score, corresponding to closeness_centrality(Freeman, 1979)

To provide a summary measure across the entire network of outlets, we computed a network centralization score. This score summarizes the variability or inequality in the degree scores of all nodes in a given network (Monge & Contractor, 2003) and is roughly analogous to the HHI (see Hindman, 2009; Yim, 2003) that measures concentration in media-centric research. Network centralization scores range from 0% to 100%. In this application, a high score indicates that audiences tend to gravitate to a few outlets (concentration), whereas a low score indicates that audiences spread their attention widely across outlets (fragmentation).

Result

The distribution shows that almost all 236 outlets have high levels of audience duplication with all other outlets(i.e., degree scores close to 100%). Furthermore, the network centralization score is 0.86%. This suggests a high level of equality in degree scores and thus evidence that the audience of any given outlet, popular or not, will overlap with other outlets at a similar level.

For instance, the Internet brand Spike Digital Entertainment reaches only 0.36%5 of the population, but its audience overlaps with close to 70% of the other outlets. Although we do not have data on individual media repertoires, these results suggest that repertoires, though quite varied, have many elements in common. The way users move across the media environment does not seem to produce highly polarized audiences.

The future of audience fragmentation

The myth of enclaves

“Long Tail forces and technologies that are leading to an explosion of variety and abundant choice in the content we consume are also tending to lead us into tribal eddies. When mass culture breaks apart it doesn’t re-form into a different mass. Instead, it turns into millions of microcultures.” Anderson, 2006

Our results indicate that, at least across the 236 outlets we examined, there are very high levels of audience overlap. The people who use any given TV channel or Web site are disproportionately represented in the audience for most other outlets.

All-in-all, there is very little evidence that the typical user spends long periods of time in niches or enclaves of like-minded speech. Alternatively, there is also little evidence that the typical user only consumes hits. Rather, most range widely across the media landscape, a pattern confirmed by the low network centralization score. They may appear in the audience of specialized outlets, but they do not stay long.

That said, neither media-centric nor audience-centric studies on fragmentation provide much evidence of a radical dismembering of society. Although Anderson (2006) can look at long tails and foresee ‘‘the rise of massively parallel culture’’, we doubt that interpretation. That suggests a profusion of media environments that never intersect.

It is more likely that we will have a massively overlapping culture. We think this for two reasons. First, there is growing evidence that despite an abundance of choice, media content tends to be replicated across platforms (e.g., Boczkowski, 2010; Jenkins, 2006; Pew, 2010). Second, while no two people will have identical media repertoires, the chances are they will have much in common. Those points of intersection will be the most popular cultural products, assuming, of course, that popular offerings persist.

The persistence of popularity

Will future audiences distribute themselves evenly across all media choices or will popular offerings continue to dominate the marketplace?

Anderson (2006, p. 181) expects that in a world of infinite choice, ‘‘hit-driven culture’’ will give way to ‘‘ultimate fragmentation.’’ Others believe that ‘‘winner-take-all’’ markets will continue to characterize cultural consumption (e.g., Elberse, 2008; Frank & Cook, 1995).

We are inclined to agree with the latter and offer three arguments why audiences are likely to remain concentrated in the digital media marketplace:

Differential quality of media products

The quality of media products is not uniformly distributed. If prices are not prohibitive, attendance will gravitate to higher quality choices.

First, the pure ‘‘public good’’ nature of digital media makes them easy to reproduce, and often ‘‘free’’ (Anderson, 2009). As Frank and Cook (1995, p. 33) noted

If the best performers’ efforts can be cloned at low marginal cost, there is less room in the market for lower ranked talents.

Second, the increased availability of ‘‘on-demand’’ media promotes this phenomenon. The move to digital video
recorders and downloaded or streamed content makes it simple to avoid the less desirable offerings that were often bundled in linear delivery systems. Consuming a diet of only the best the market has to offer is easier than ever before. This effectively reduces the number of choices and concentrates attention on those options.(Recommandation System)

The social desirability of media selections

Media have long served as a ‘‘coin-of-exchange’’ in social situations (Levy & Windahl, 1984). A few programs, sporting events, or clips on YouTube are the stuff of water-cooler conversations, which encourages those who
want to join the discussion to see what everyone else is talking about.

The advent of social media, such as Facebook and Twitter, may well extend these conversations to virtual spaces and focus the attention of those networks on what they find noteworthy. Often this will be popular, event-driven programming.

Recent studies on simultaneous media use during the 2010 Super Bowl and opening ceremonies of the Winter Olympics suggest that individuals use social media to discuss these events as they watch TV (NielsenWire, 2010, February 12; 2010, February 19).

The media measures that inform user choices

Because digital media are abundant and the products involved are experience goods, users depend on recommendation systems to guide their consumption. Although search and recommendation
algorithms vary, most direct attention to popular products or outlets(Webster,2010).

The more salient that user information, the more markets are inclined to produce winner-take-all results, although the actual winners are impossible to predict before the process begins. Under such circumstances, the ‘‘wisdom of crowds’’ (Surowiecki, 2004) may not be a reliable measure of quality, but it concentrates public attention nonetheless.

Conclusion

The persistence of popularity, and the inclination of providers to imitate what is popular, suggests that audiences will not spin off in all directions. Although the ongoing production of media by professionals and amateurs alike will grow the long tail ever longer, that does not mean endless fragmentation. Most niche media will be doomed to obscurity and the few who pay a visit will spend little time there.

Rather, users will range widely across media outlets, devoting much of their attention to the most salient offerings. Those objects of public attention will undoubtedly be more varied than in the past. They will often, though not always, be the best of their kind. They will be the media people talk about with friends and share via social networks. Their visibility and meaning may vary across the culture, but they will constitute the stuff of a common, twenty-first-century cultural forum.

The graph_tool module provides a Graph class and several algorithms that operate on it. The internals of this class, and of most algorithms, are written in C++ for performance, using the Boost Graph Library.

Python modules are usually very easy to install, typically requiring nothing more that pip install for basically any operating system. For graph-tool, however, the situation is different. This is because, in reality, graph-tool is a C++ library wrapped in Python, and it has many C++ dependencies such as Boost, CGAL and expat, which are not installable via Python-only package management systems such as pip. Because the module lives between the C++ and Python worlds, its installation is done more like a C++ library rather than a typical python module. This means it inherits some of the complexities common of the C++ world that some Python users do not expect.

The easiest way to get going is to use a package manager, for which the installation is fairly straightforward. This is the case for some GNU/Linux distributions (Arch, Gentoo, Debian & Ubuntu) as well as for MacOS users using either Macports or Homebrew.

Reference

Use brew to install graph-tool on MacOS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
brew tap homebrew/science
brew install graph-tool
brew update
Error: /usr/local is not writable. You should change the ownership
and permissions of /usr/local back to your user account:
sudo chown -R $(whoami) /usr/local
sudo chown -R zhicongchen /usr/local
brew update
After brew update:
==> Migrated HOMEBREW_REPOSITORY to /usr/local/Homebrew!
Homebrew no longer needs to have ownership of /usr/local. If you wish you can
return /usr/local to its default ownership with:
sudo chown root:wheel /usr/local
sudo chown root:wheel /usr/local

brew installed another python into /usr/local/Cellar/python/2.7.13/bin/python, so we can open that python to use graph-tool.

1
2
3
4
5
运行python:
/usr/local/Cellar/python/2.7.13/bin/python
导入graph-tool:
from graph_tool.all import *

It seems that graph-tool cannot be directly installed into anaconda till now, as is said below:

You cannot really mix homebrew with anaconda without defeating the whole purpose of isolated environments. I would try to find a way to install graph-tools directly into your anaconda environment. It seems that there are packages on anaconda cloud. But I am not sure how easy it is to install those. – cel Dec 24 ‘15 at 7:50

http://stackoverflow.com/questions/34447563/force-home-brew-to-install-graph-tools-to-the-anaconda-python-interpretor

Question

There is an equation of exponential truncated power law in the article below:

Gonzalez, M. C., Hidalgo, C. A., & Barabasi, A. L. (2008). Understanding individual human mobility patterns. Nature, 453(7196), 779-782.

like this:

1
P(r_g)=(r_g+r_g^0)^{-\beta_r}exp(-r_g/\kappa)

It is an exponential truncated power law. There are three parameters to be estimated: rg0, beta and K. Now we have got several users’ radius of gyration(rg)

You can directly get rg and prg data as the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
rg = np.array([ 20.7863444 , 9.40547933, 8.70934714, 8.62690145,
7.16978087, 7.02575052, 6.45280959, 6.44755478,
5.16630287, 5.16092884, 5.15618737, 5.05610068,
4.87023561, 4.66753197, 4.41807645, 4.2635671 ,
3.54454372, 2.7087178 , 2.39016885, 1.9483156 ,
1.78393238, 1.75432688, 1.12789787, 1.02098332,
0.92653501, 0.32586582, 0.1514813 , 0.09722761,
0. , 0. ])
prg = np.array([ 0. , 0.03448276, 0.06896552, 0.10344828, 0.13793103,
0.17241379, 0.20689655, 0.24137931, 0.27586207, 0.31034483,
0.34482759, 0.37931034, 0.4137931 , 0.44827586, 0.48275862,
0.51724138, 0.55172414, 0.5862069 , 0.62068966, 0.65517241,
0.68965517, 0.72413793, 0.75862069, 0.79310345, 0.82758621,
0.86206897, 0.89655172, 0.93103448, 0.96551724, 1. ])

How can I use these data of rgs to estimate the three parameters above? I hope to solve it using python.

Answser

According to @Michael ‘s suggestion, we can solve the problem using scipy.optimize.curve_fit

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def func(rg, rg0, beta, K):
return (rg + rg0) ** (-beta) * np.exp(-rg / K)
from scipy import optimize
popt, pcov = optimize.curve_fit(func, rg, prg, p0=[1.8, 0.15, 5])
print popt
print pcov
The results are given below:
[ 1.04303608e+03 3.02058550e-03 4.85784945e+00]
[[ 1.38243336e+18 -6.14278286e+11 -1.14784675e+11]
[ -6.14278286e+11 2.72951900e+05 5.10040746e+04]
[ -1.14784675e+11 5.10040746e+04 9.53072925e+03]]

Reference

scipy.optimize.curve_fit(f, xdata, ydata, p0=None, sigma=None, absolute_sigma=False, check_finite=True, bounds=(-inf, inf), method=None, jac=None, **kwargs)

Use non-linear least squares to fit a function, f, to data.

Assumes ydata = f(xdata, *params) + eps

1
2
3
4
5
6
7
8
9
10
11
12
Examples
>>> import numpy as np
>>> from scipy.optimize import curve_fit
>>> def func(x, a, b, c):
... return a * np.exp(-b * x) + c
>>>
>>> xdata = np.linspace(0, 4, 50)
>>> y = func(xdata, 2.5, 1.3, 0.5)
>>> ydata = y + 0.2 * np.random.normal(size=len(xdata))
>>>
>>> popt, pcov = curve_fit(func, xdata, ydata)

做数据驱动的社会科学研究,需要对小数据进行精雕细琢,从而培养对大数据的直觉和洞察力。

不能把过多地时间放在数据规模的scale-up上,不论是Python、Shell还是Hadoop、Spark,那些是工程性的工作,而我现在不是要往工程的方向发展,而是要往研究的方向发展。我需要的是快速出成果,出故事,出论文,所以要精雕细琢,挖掘故事,兼顾对大数据的反思和审视。

81天的数据,100多个G,很尴尬的数据量,花了很多时间去捣鼓这个数据,却一直在外围打转,对数据本身的特点还是一无所知,这样下去,何时才能真正搞懂这个数据?
所以还是先从一天的数据着手,以小见大,自底向上。

作为社会科学家,在面对”大数据”时,核心的竞争力还是对数据的理解、熟悉和敏感,以及对社会问题深刻的思考和洞察。