数据准备

• No: row number
• year: year of data in this row
• month: month of data in this row
• day: day of data in this row
• hour: hour of data in this row
• pm2.5: PM2.5 concentration
• DEWP: Dew Point
• TEMP: Temperature
• PRES: Pressure
• cbwd: Combined wind direction
• Iws: Cumulated wind speed
• Is: Cumulated hours of snow
• Ir: Cumulated hours of rain

简介

fork了本科同学写的一个简单的写诗机器人，训练数据是从网上抓取的3万首唐诗，模型是LSTM（RNN），原理和Word2Vec类似，详情可以读他的CSDN博客：http://blog.csdn.net/accepthjp/article/details/73875108

• MacOS X
• Python2.7 on Anaconda
• TensorFlow1.0

总结

https://github.com/zhicongchen/Chinese_poem_generator

http://nbviewer.jupyter.org/github/zhicongchen/datalab/blob/master/Getting%20Started%20With%20TensorFlow.ipynb

• 不管是图像还是文本处理，数据都有很强的locality，其它的数据未必就有这种特点
• 现在看到的tutorials全是文本和图像，感觉只能做这个？
好像有个说法是DL之所以在文本和图像上大放异彩是因为他们被很好地表示（representation）了，文本可以word2vec，图像本身就是vector
• 如何将手机数据建立深度学习模型？

（maybe）可以用node2vec，然后用得到的向量作为feature，去预测目标量

• 试着做一下这个，使用百度阅读的数据。构建注意力流动网络，node2vec或者deepwalk得到节点的向量，喂给深度学习模型，用来预测图书的销售量和销售额。
• 我刚才贴那个例子，做回归分析深度学习的框架是可以做的。我不知道神经网络的模型，比如cnn是否可以用来搞

node2vec的效果怎么样？如果它已经很完美了，为什么还有那么多人研究NRL（Network Representation Learning）？
NRL研究的难点到底在哪里？

AI & Public Policy, Tsinghua University & University of Chicago, July 22-23, 2017

7.22

Keynote Speech

Xue Lan

Public Policy in Tsinghua: wide range research, set up in 2000

Four major transformations in China since 1979

• Economic system: planning->market
• Industrial structure: Agriculture+manufacturing-> manufacturing
• Society: rural->urban; closed->open
• governance system: chrisma & authority -> efficiency

R&D Expenditures

• How is Chinese publications’s quality as it has growed a lot?
• How to measure it?
• How to explain the rise of Chinese publications in high quality journals?

use Excellence in Research for Australia journals as a criterion to select journals

• 知识社会学(Intellectual Sociology)
• Science of Science - Individual, Citation
• 文献情报与信息管理 - Journal, Influence, Index

James Evans

Science as a complex system

• How does science reproduce?
• How does science evolve?
• How does science persist?
• How do fields ignite?

Through representation, everything is close to each other. It’s a small world after all.

Predicting Paper/Patents

Novel(improbable) outcomes: Novelty - 1/P(I|A, S)

Content Novelty & Context Novelty: 0 correlation

• content: people combining knowledge, pooling concepts
• context: the community you draw from

Science thinks like a Global Bayesian. Science thinks not like what scientist thinks.

What’s Science’ Objective?

• Solving the world’s problems.
• Discovering what it discovers
• Transforming itself
• Generating robust, generalizable knowledge?

topic models - a mathematical identical way to realize the paper

Dashun Wang

dashunwang@gmail.com

Predictive Signals Behind Success

Using Social theories, combining mathematical methods
Just like the keynote on IC2S2, 2017

Q: Success can be measured, modeled and predicted?
the collective feature of success
You are successful because all of others think you are successful

Modeling Citation Dynamics: 3 factors

• Preferential Attachment
• Aging
• Intrinsic Novelty

Combine the 3 factors to measure the probability of paper citation and it can be solved analytically
Rescaled Citation and Rescaled Time

Quantifying the evolution of individual scientific impact

Will a scientist produce higher impact work following a major discovery? Hope.

Timing of the hits is high between 0 and 20 years, decays afterwards? Actually it is random! It decays just because their publications decayed. Method: break up the timeline and choose a middle position to observe.

What happends after your biggest hit? Winning begets more winnings

Hot hand phenomenon in artistic, cultural and scientific careers. Biggest vs. Second Biggest hit·

What is the innovation of diffusion? It’s not about adoption, but about substitution.How the substitution looks like? Exponential Growth/ Logistic Growth

• Handset, Impact: number of handsets sold, every handsets have their own exponential parameters
• Automobiles,

Power law grows much much slower than exponential. What mechanisms are responsible for the observed non-analytic growth?

Understanding Patterns of Substitutions

A Country-wide Mobile Phone Datasets: 3.25M Users, Everyday over 10 years, ~9000 handset models

Metric: Substitution Probability, determined by 3 factors(model)

• Preferential Return
• Recency
• Propensity

3 different systems, determined by the same 3 factors(mechanisms)

Taken together

• Impact grows as power law with non-integer exponents
• By exploring large-scale datasets, we find three mechanisms govening substitution patterns
• We derive Minimal Substitution model, allowing us to not only predict the observed growth pattern, but also to collapse impact trajectories into one universal curve
• The Minimal Substitution model predicts an intriguing connection between short term impact and long term impact

To finish this work this summer, I hope.

10% -> 60%, 75 years
Theory by geology, provide innovators more data to uncover the fundamental mechanisms behind it.

Question:

• Why these 3 factors?
• Given 3 factors, how did you build the model in that way? How did you evaluate that it works best?
It is the minimal model we can have according to our citations. After all, we can make sure that the 3 mechanisms work.
Only by the curve-fitting technology can we find the minimal result.
赵洪洲

Session 1

Lingfei Wu

Team Science
Small teams create problems and grow attention into future, big teams solve them and harvest. Big teams chase successful works of small teams

Sleeping Beauty Index - PNAS

Dongbo Shi

Funding and Scientific Research: National Science Fund for Distinguished Young Scholars

Yian Yin

The Nature of Repeated Failures

Data has to ‘outlive’ individual careers, NIH datasets

alpha - stiffness, use alpha to build up a model

Each failure-success is a circle

Tao Jia

School of Computer Science, Southwest University

Probing Behavior of Scientists

Quantifying patterns of research-interests evolution, Nature Human Behavior

We are what we repeatedly do. - Aristotle

Big Data -> Activities -> Features -> User Profile
Three Features:

• Heterogeneity: topic tuple usage in an individual’s career follows a power-law distribution
• Recency: An individual is more likely to publish on research subjects studied recently
• Subject Proximity

Model: Scientific research is like a random walk
To what degree could these patterns be captured by a simple statistical model?

MeiJun Liu

Faculty of Education, University of Hong Kong
Age and team of great scientific discoveries in China
On-going work, only some figures presented

Session 2

YongRen Shi

Bots improve human coordination in network experiments
Amazon Mechanical Turk: Online Labor Market - Game on Network - Quantitive Data. breadboard.yale.edu

How can bots accelerate the coordination process?
Every player chooses the best color locally, but the problem was not solved

Kevin Gao

Microsoft Research, NYC, @hb123boy

Conducting human subjects experiments in the ‘virtual lab’
Computational Social Science

In 1950s, people are set in an experimental room to be tested.

Virtual Lab: Bring the lab closer to the real world, using the Internet as a lab

• Complexity Realism
• Duration, Participation
• Size, Scale

TurkServer, built on Meteor web app framework: https://github.com/TurkServer/turkserver-meteor, Crowd Mapper, Andrew Mao, Winter Mason, Siddharth Suri, Duncan Watts

Intertemporal Choice, Kevin Gao, Dan Goldstein

Long-run Cooperation, a very long prisoner’s dilemma experiment, Andrew Mao…Duncan Watts

Han Zhang

PhD candidate in Princeton University, collaborated with Jennifer Pan

Identifying protests from social media data, using deep learning techniques

Training datasets: The Lu dataset, collected by Chinese lawyer Yuyu Lu, from blogspot

Hard task: text are short and meanings are tricky

• Text: RNN(LSTM)
• Image: 4-layer-CNN

YuanHao Liu

Officer mobility in China, What factors will influence?

Data Source: Prof. Zhou Xueguang

Factor based to agent based - Casual inference to Sufficient Condition. Fractual Network.

Logic -> Structure: Rich-get-richer and hub-repulse
Not Markov Process, Not Random Walk. Efficient way to fill a space: 3/4 law

7.23

Xingyuan Yuan

RNN - LSTM

• Nowcasting
• Machine Translation
• Music Composer
No music theory, Representation, loss function, sequence2sequence

Yan Xia

Deep learning in autonomous driving, Momenta

Industrial Thinking: Technology must go first.

Only successful way in industry - Supervised way
Big Data

• Public: ImageNet
• Blooming of Internet
Big Computation
• GPUs
Software and Infrastructure(data storage)
• Git, AWS, Amazon Mechanical Turk(for labeling)

Faster R-CNN, arxiv.org/abs/1506.01497, Fully Convoluntional Networks for Semantic Segmentation

Jiang Zhang

Physians: Your work is so ugly! There are too many parameters in your model.

Map of Complexity Science, by Brian Castellani

Complexity - AI

Why bother a neural network

• It is a a good predictor
• It can extract features automatically

Deep Learning fights poverty, Science, remote sensing data

Feature extraction

use CNN(feature extractor) to train a model to predict lightness(already labeled data)， use these feature(model - first several layers) to concate another model to predict poverty(transfer learning)

Complext network classifier

use a neural network to classify the complex networks(small-world or scale-free), network representation. Image is most easy to be encoded(represented), text is also solved by word2vec.

Deep walk algorithm - use random walk to generate sequence
What the CNN learns - 2 filters

How to recognize without links - Deep Walk can contain the link information into coodinates.

Deep Learning can be used to solve complex network problems. Can DNN become an expert in complex network?

Yizhuang You

Hyperbolic Network, Boltzmann Machine and Holographic Duality
Popularity vs. Similarity
~ renormalization - field theory
Boltzmann Machine, Hyperbolic Space

Lei Dong

Data-driven urban studies, combined computer science and economics

Former Data Scientist in Baidu, Data Science Company - QuantUrban

What we can do with mobile phone data?

• Mobile Phone Data and Urban Dynamics, Real-time dynamics
• Day and Night Population Distribution
• Mapping Home-Work Connection with Machine Learning, 百度地图 - 常去地点, Rule-based, Label
• Commuting Data and Visualization
• Community Detection and City Boundary
• Population Migration During Spring Festival
• Spatial-temporal Behaviors and Economics
• Mobile Internet Coverage and Poverty

Toolkits for social scientists

• Spider - system, dashboard
• Mobile Turk - label data

The video on IC2S2

How can Big Data help us understand human behavior, social networks, and success?

Success is about the future impact of a person, in another words, the only question in our mind is whether he will go up or down in his career. In the past, it is relied on our qualitative judgement. Now we can turn to massive datasets. So let’s try to find a quantitive way to study the formation of success. Can we make success measured, modeled and predicted, just like the way we conquered on natural phenomenon?

Science of Science

• Robert Merton: Matthew Effect, Singletons and multiples
• Harriet Zuckerman: Scientific Elites
• Derek de Solla Price: Invisible College, Power law, Cumulative advantage

Q: Can success be measured, modeled and predicted?

• the collective feature of success
• You are successful because all of others think you are successful

Modeling Citation Dynamics

Three generic factors

• Preferential Attachment
• Aging
• Intrinsic Novelty

Combine the 3 factors to measure the probability of paper citation and it can be solved analytically(Rescaled Citation and Rescaled Time)

Quantifying the evolution of individual scientific impact

Will a scientist produce higher impact work following a major discovery? Hope.

Timing of the hits is high between 0 and 20 years, decays afterwards? Actually it is random! It decays just because their publications decayed. Method: break up the timeline and choose a middle position to observe.

What happends after your biggest hit? Winning begets more winnings.

The citation dynamics of paper i follows three parameters

• Fitness
• Immediacy
• Longevity

Hot hand phenomenon in artistic, cultural and scientific careers. Biggest vs. Second Biggest hit·

How the substitution looks like? Exponential Growth/ Logistic Growth

• Handset, Impact: number of handsets sold, every handsets have their own exponential parameters
• Automobiles,

Power law grows much much slower than exponential. What mechanisms are responsible for the observed non-analytic growth?

Understanding Patterns of Substitutions

A Country-wide Mobile Phone Datasets: 3.25M Users, Everyday over 10 years, ~9000 handset models

Metric: Substitution Probability, determined by 3 factors(model) - What mechanisms are responsible for the observed non-analytic growth?

• Preferential Return
• Recency
• Propensity

3 different systems, determined by the same 3 factors(mechanisms)

Substitution Patterns(Three parameters)

• Anticipation
• Fitness
• Longevity

Taken together

• Impact grows as power law with non-integer exponents
• By exploring large-scale datasets, we find three mechanisms govening substitution patterns
• We derive Minimal Substitution model, allowing us to not only predict the observed growth pattern, but also to collapse impact trajectories into one universal curve
• The Minimal Substitution model predicts an intriguing connection between short term impact and long term impact

To finish this work this summer, I hope.

A story about 10%: oil discovery from 10% -> 60% after 75 years. It is caused by improvement of theory by geology, providing innovators more data to uncover the fundamental mechanisms behind it.

Question:

• Why these 3 factors?
• Given 3 factors, how did you build the model in that way? How did you evaluate that it works best?
It is the minimal model we can have according to our citations. After all, we can make sure that the 3 mechanisms work.
Only by the curve-fitting technology can we find the minimal result.

Reference

• Sinatra, Wang, Deville and Barabasi, Science, 2016
• Liu, Wang, Giles, Sinatra, Song and Wang, 2017
• Jin, Song, Bjelland, Canright and Wang, 2017

Song, C., Qu, Z., Blumm, N., & Barabási, A. L. (2010). Limits of predictability in human mobility. Science, 327(5968), 1018-1021.

Indeed, although we rarely perceive any of our actions to be random, from the perspective of an outside observer who is unaware of our motivations and schedule, our activity pattern can easily appear random and predictable.

Background

At present, the most detailed information on human mobility across a large segment of the population is collected by mobile phone carriers

We assign three entropy measures to each individual’s mobility pattern

• the random entropy
• the temporal uncorrelated entropy
• the actual entropy

removed 5000 users with the highest q from our data set, which ensured that all remaining 45,000 users satisfied q < 0.8.

Data

D1: This anonymized data set represents 14 weeks of call patterns from 10 million mobile phone users (roughly April through June 2007). The data contains the routing tower location each time a user initiates or receives a call or text message. From this information, a user’s trajectory may be reconstructed.

For each user i we define the calling frequency fi as the average number of calls per hour, and the number of locations Ni as the number of distinct towers visited during the three month period.
In order to improve the quality of trajectory reconstruction, we selected 50,000 users with fi >= 0.5 calls/hour and Ni > 2.

D2: Mobile services such as pollen and traffic forecasts rely on the approximate knowledge of customer’s location at all times. For customers voluntarily enrolled in such services, the date, time and the closest tower coordinates are recorded on a regular basis, independent of phone usage. We were provided with the anonymized records of 1,000 such users, from which we selected 100 users whose coordinates were recorded every hour over eight 8 days.

Metrics

Determined Si, Siunc, and Sirand for each user i

True Entropy

Entropy rate: https://en.wikipedia.org/wiki/Entropy_rate

Elements of Information Theory

Lempel-Ziv Compression Algorithm: https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch

http://rosettacode.org/wiki/LZW_compression#Python

This quantity is subject to Fano’s inequality (24, 26).

Regularity

We measured each user’s regularity, R, defined as the probability of finding the user in his most visited location during that hour.

Results

For a user with TTmax = 0.2, this means that at least 80% of the time the individual chooses his location in a manner that appears to be random, and only in the remaining 20% of the time can we hope to predict his or her whereabouts.

To our surprise, we found that P(Pmax) does not follow the fat-tailed distribution suggested by the travel distances, but it is narrowly peaked near Pmax ≈ 0.93 (Fig. 2B).

To reconcile the wide variability in the observed travel distances, we measured the dependency of Pmax on rg,

To determine how much of our predictability is really rooted in the visitation patterns of the top locations, we calculated the probability P ̃ that, in a given moment, the user is in one of the top n most visited locations, where n = 2 typically captures home and work.

Conclusion

It is not unreasonable to expect, therefore, that predictability should also vary widely: For people who travel little, it should be easier to foresee their location, whereas those who regularly cover hundreds of kilometers should have a low predictability. Despite this inherent population heterogeneity, the maximal predictability varies very little—indeed P(Pmax) is narrowly peaked at 93%, and we see no users whose predictability would be under 80%.
Although making explicit predictions on user whereabouts is beyond our goals here, appropri- ate data-mining algorithms (19, 20, 27) could turn the predictability identified in our study into actual mobility predictions.

With the help of Python, we extracted a list of user’s locations in one month with a weight of count, presented as below:

Create a new html file referring to the template below, in which the data should be substituted.

http://developer.baidu.com/map/jsdemo.htm#c1_15

Specifically, The user’s key should be your own registered one given by BaiduMap.

Open the html file in Chrome, and you can get:

Reference

https://zhuanlan.zhihu.com/p/25845538

Luo, S. et al. Inferring personal economic status from social
network location. Nat. Commun. 8, 15227 doi: 10.1038/ncomms15227 (2017).

Data

The social network is constructed from mobile (calls and SMS metadata) and residential communications data, in another words, CDR(Call Detail Record), collected for a period of 122 days from a Latin American country.

The financial dataset from a major bank in the same country was collected during the same time period as the mobile dataset. The dataset consists of records of the bank clients’ age, gender, credit score, total transaction amount during each billing period, credit limit of each credit card, balance of cards (including debit and credit), zip code of billing address, and encrypted registered phone number.

Metrics

Collective Influence (CI) is an algorithm to identify the most influential nodes via optimal percolation.

CI minimizes the largest eigenvalue of a modified non-backtracking matrix of the network in order to find the minimal set of nodes to disintegrate the network.

CI has advantages in resolution, correlation with wealth, and scalability to massively large social networks.

CI is a concept proposed by

Morone, F. & Makse, H. A. Influence maximization in complex networks through optimal percolation. Nature 524, 65–68 (2015).

Results

Communication Patterns v.s. Economic Status

It is visually apparent that the top 1% (accounting for 45.2% of the total credit in the country) displays a completely different pattern of communication than the bottom 10%; the former is characterized by more active and diverse links, especially connecting remote locations and communicating with other equally affluent people.

The wealthiest 1-percenters have higher diversity in mobile contacts and are centrally located, surrounded by other highly connected people (network hubs). On the other hand, the poorest individuals have low contact diversity and are weakly connected to fewer hubs.

Fraction of Wealthy Individuals v.s. Age and Network Metrics

Correlation between the fraction of wealthy individuals versus age and (a) degree k (R2 = 0.92), (b) k-shell (R2 = 0.96), (c) PageRank (R2 = 0.96) and (d) log10CI (R2 = 0.93).

Further correlations are studied in Supplementary Note 6, indicating that CI could be considered as the most convenient metric out of the four due to its high resolution.

When we combine age and CI quantile ranking into an age-network composite: $ANC = \alpha Age / (1 - \alpha) CI$, with $\alpha = 0.5$, a remarkable correlation (R2 = 0.99, Fig. 3c) is achieved.

• Wei, R. (2014). Texting, tweeting, and talking: Effects of smartphone use on engagement in civic discourse in China. Mobile Media & Communication, 2(1), 3-19.
• Wei, R., Lo, V. H., Xu, X., Chen, Y. N. K., & Zhang, G. (2014). Predicting mobile news use among college students: The role of press freedom in four Asian cities. new media & society, 16(4), 637-654.

http://smd.sjtu.edu.cn/teacher/detail/id/23

Ran Wei, PhD, is the Gonzales Brothers Professor of Journalism in the School of Journalism & Mass Communications at the University of South Carolina, USA. A former TV journalist, active media consultant, and incoming Editor-in-Chief of Mass Communication & Society, his research focuses on media effects in society and digital new media, including wireless computing and mobile media.

https://www.sc.edu/study/colleges_schools/cic/faculty-staff/wei_ran.phphttps://www.sc.edu/study/colleges_schools/cic/faculty-staff/wei_ran.php

Webster, J. G., & Ksiazek, T. B. (2012). The dynamics of audience fragmentation: Public attention in an age of digital media. Journal of communication, 62(1), 39-56. click here

Abstract

Audience fragmentation is often taken as evidence of social polarization. We offer a theoretical framework for understanding fragmentation and advocate for more audience-centric studies. We find extremely high levels of audience duplication across 236 media outlets, suggesting overlapping patterns of public attention rather than isolated groups of audience loyalists.

Three factors that shape fragmentation

Media Providers

The most obvious cause of fragmentation is a steady growth in the number of media outlets and products competing for public attention.

Media Users

What media users do with all those resources is another matter. Most theorists expect them to choose the media products they prefer. Those preferences might reflect user needs, moods, attitudes, or tastes, but their actions are ‘‘rational’’ in the sense that they serve those psychological predispositions.

Media Measures

Media measures exercise a powerful influence on what users ultimately consume and how providers adapt to and manage those shifting patterns of attendance. Indeed, information regimes can themselves promote or mitigate processes of audience fragmentation

Three different ways of studying fragmentation

Media-centric fragmentation

An increasingly popular way to represent media-centric data is to show them in the form of a long tail (Anderson, 2006).

Concentration can be summarized with any one of several statistics, including Herfindahl–Hirschman indices (HHIs) and Gini coefficients (see Hindman, 2009; Yim, 2003).

Herfindahl–Hirschman indices (HHIs)

The Herfindahl index (also known as Herfindahl–Hirschman Index, or HHI) is a measure of the size of firms in relation to the industry and an indicator of the amount of competition among them. It is defined as the sum of the squares of the market shares of the firms within the industry (sometimes limited to the 50 largest firms), where the market shares are expressed as fractions. The result is proportional to the average market share, weighted by market share. As such, it can range from 0 to 1.0, moving from a huge number of very small firms to a single monopolistic producer.

• $X_i$ - 第i个企业的规模
• $X$ - 市场总规模
• $S_i$ - 市场占有率

Gini Coefficients

The Gini coefficient is a measure of statistical dispersion intended to represent the income or wealth distribution of a nation’s residents, and is the most commonly used measure of inequality. A Gini coefficient of zero expresses perfect equality, where all values are the same (for example, where everyone has the same income). A Gini coefficient of 1 (or 100%) expresses maximal inequality among values (e.g., for a large number of people, where only one person has all the income or consumption, and all others have none, the Gini coefficient will be very nearly one).

if $x_i$ is the wealth or income of person $i$, and there are $n$ persons, then the Gini coefficient $G$ is given by:

Results

In Figure 1, the drop-off in cable network attendance is not precipitous, producing an HHI of 144.17, which suggests a modest level of overall concentration.

the HHI for Figure 2 is 173.14, indicating that the use of Internet brands is more concentrated than the use of television channels. Typically, audiences in less abundant media, such as radio and television, are more evenly distributed across outlets(i.e., fragmented) than in media with many choices such as the Internet(Hinderman, 2009;Yim, 2003)

User-centric fragmentation

This approach focuses on each individual’s use of media. It is fragmentation at the microlevel. Most of the literature on selective exposure would suggest that people will become specialized in their patterns of consumption, which is called “media repertories” or “channel repertories”. Most studies focus on explaining the absolute size of repertories, but often say little about their composition.

A user-centric approach has the potential to tell us what a typical user encounters over some period of time. They rarely ‘‘scale-up’’ to the larger issues of how the public allocates its attention across media.

Audience-centric fragmentation

A useful complement to the media- and user-centric approaches described above would be an ‘‘audience-centric’’ approach. This hybrid approach is media-centric in the sense that it describes the audience for particular media outlets. It is user-centric in that it reflects that varied repertories of audience members, which are aggregated into measures that summerize each audience.

A network analytic approach to fragmentation

How to Build Network

The enlarged portion shows the link (i.e., the level of duplication) between a pair of nodes, NBC Affiliates and the Yahoo! brand, where 48.9% of the audience watched NBC and also visited a Yahoo! Web site during March 2009.

• Node - media
• Edge - duplication of audience(percent)

the question was how much duplication should be required to declare a link?

expected duplication v.s. observed duplication

Our approach was to compare the observed duplication between two outlets to the ‘‘expected duplication’’ due to chance alone. Expected duplication was determined by multiplying the reach of each outlet. So, for example, if outlet A had a reach of 30% and outlet B a reach of 20%, then 6% of the total audience would be expected to have used each just by chance.1 If the observed duplication exceeded the expected duplication, a link between two outlets was declared present (1); if not, it was absent (0) (see Ksiazek, 2011, for a detailed treatment of this operationalization).

In another words, we can firstly build an attention flow network, then prune those edges whose expected duplication is smaller than observed duplication.

Network Metrics

• degree score: converted into percent, corresponding to nx.degree_centrality()

For each outlet, the number of links is totaled to provide a degree score. For ease of interpretation, we converted these totals to percentages. So, for example, if an outlet had links to all the other 235 outlets, its degree score was 100%. If it had links to 188 outlets, its degree score was 80%.

• network centralization score, corresponding to closeness_centrality(Freeman, 1979)

To provide a summary measure across the entire network of outlets, we computed a network centralization score. This score summarizes the variability or inequality in the degree scores of all nodes in a given network (Monge & Contractor, 2003) and is roughly analogous to the HHI (see Hindman, 2009; Yim, 2003) that measures concentration in media-centric research. Network centralization scores range from 0% to 100%. In this application, a high score indicates that audiences tend to gravitate to a few outlets (concentration), whereas a low score indicates that audiences spread their attention widely across outlets (fragmentation).

Result

The distribution shows that almost all 236 outlets have high levels of audience duplication with all other outlets(i.e., degree scores close to 100%). Furthermore, the network centralization score is 0.86%. This suggests a high level of equality in degree scores and thus evidence that the audience of any given outlet, popular or not, will overlap with other outlets at a similar level.

For instance, the Internet brand Spike Digital Entertainment reaches only 0.36%5 of the population, but its audience overlaps with close to 70% of the other outlets. Although we do not have data on individual media repertoires, these results suggest that repertoires, though quite varied, have many elements in common. The way users move across the media environment does not seem to produce highly polarized audiences.

The future of audience fragmentation

The myth of enclaves

“Long Tail forces and technologies that are leading to an explosion of variety and abundant choice in the content we consume are also tending to lead us into tribal eddies. When mass culture breaks apart it doesn’t re-form into a different mass. Instead, it turns into millions of microcultures.” Anderson, 2006

Our results indicate that, at least across the 236 outlets we examined, there are very high levels of audience overlap. The people who use any given TV channel or Web site are disproportionately represented in the audience for most other outlets.

All-in-all, there is very little evidence that the typical user spends long periods of time in niches or enclaves of like-minded speech. Alternatively, there is also little evidence that the typical user only consumes hits. Rather, most range widely across the media landscape, a pattern confirmed by the low network centralization score. They may appear in the audience of specialized outlets, but they do not stay long.

That said, neither media-centric nor audience-centric studies on fragmentation provide much evidence of a radical dismembering of society. Although Anderson (2006) can look at long tails and foresee ‘‘the rise of massively parallel culture’’, we doubt that interpretation. That suggests a profusion of media environments that never intersect.

It is more likely that we will have a massively overlapping culture. We think this for two reasons. First, there is growing evidence that despite an abundance of choice, media content tends to be replicated across platforms (e.g., Boczkowski, 2010; Jenkins, 2006; Pew, 2010). Second, while no two people will have identical media repertoires, the chances are they will have much in common. Those points of intersection will be the most popular cultural products, assuming, of course, that popular offerings persist.

The persistence of popularity

Will future audiences distribute themselves evenly across all media choices or will popular offerings continue to dominate the marketplace?

Anderson (2006, p. 181) expects that in a world of infinite choice, ‘‘hit-driven culture’’ will give way to ‘‘ultimate fragmentation.’’ Others believe that ‘‘winner-take-all’’ markets will continue to characterize cultural consumption (e.g., Elberse, 2008; Frank & Cook, 1995).

We are inclined to agree with the latter and offer three arguments why audiences are likely to remain concentrated in the digital media marketplace:

Differential quality of media products

The quality of media products is not uniformly distributed. If prices are not prohibitive, attendance will gravitate to higher quality choices.

First, the pure ‘‘public good’’ nature of digital media makes them easy to reproduce, and often ‘‘free’’ (Anderson, 2009). As Frank and Cook (1995, p. 33) noted

If the best performers’ efforts can be cloned at low marginal cost, there is less room in the market for lower ranked talents.

Second, the increased availability of ‘‘on-demand’’ media promotes this phenomenon. The move to digital video
recorders and downloaded or streamed content makes it simple to avoid the less desirable offerings that were often bundled in linear delivery systems. Consuming a diet of only the best the market has to offer is easier than ever before. This effectively reduces the number of choices and concentrates attention on those options.(Recommandation System)

The social desirability of media selections

Media have long served as a ‘‘coin-of-exchange’’ in social situations (Levy & Windahl, 1984). A few programs, sporting events, or clips on YouTube are the stuff of water-cooler conversations, which encourages those who
want to join the discussion to see what everyone else is talking about.

The advent of social media, such as Facebook and Twitter, may well extend these conversations to virtual spaces and focus the attention of those networks on what they find noteworthy. Often this will be popular, event-driven programming.

Recent studies on simultaneous media use during the 2010 Super Bowl and opening ceremonies of the Winter Olympics suggest that individuals use social media to discuss these events as they watch TV (NielsenWire, 2010, February 12; 2010, February 19).

The media measures that inform user choices

Because digital media are abundant and the products involved are experience goods, users depend on recommendation systems to guide their consumption. Although search and recommendation
algorithms vary, most direct attention to popular products or outlets(Webster,2010).

The more salient that user information, the more markets are inclined to produce winner-take-all results, although the actual winners are impossible to predict before the process begins. Under such circumstances, the ‘‘wisdom of crowds’’ (Surowiecki, 2004) may not be a reliable measure of quality, but it concentrates public attention nonetheless.

Conclusion

The persistence of popularity, and the inclination of providers to imitate what is popular, suggests that audiences will not spin off in all directions. Although the ongoing production of media by professionals and amateurs alike will grow the long tail ever longer, that does not mean endless fragmentation. Most niche media will be doomed to obscurity and the few who pay a visit will spend little time there.

Rather, users will range widely across media outlets, devoting much of their attention to the most salient offerings. Those objects of public attention will undoubtedly be more varied than in the past. They will often, though not always, be the best of their kind. They will be the media people talk about with friends and share via social networks. Their visibility and meaning may vary across the culture, but they will constitute the stuff of a common, twenty-first-century cultural forum.