Computational Social Science

David Lazer1, Alex Pentland2, Lada Adamic3, Sinan Aral2,4, Albert-László Barabási5, Devon Brewer6, Nicholas Christakis1, Noshir Contractor7, James Fowler8, Myron Gutmann3, Tony Jebara9, Gary King1, Michael Macy10, Deb Roy2, Marshall Van Alstyne2,11

  1. Harvard University, Cambridge, MA, USA. E-mail: david_lazer@harvard.edu
  2. Massachusetts Institute of Technology, Cambridge, MA, USA.
  3. University of Michigan, Ann Arbor, MI, USA.
  4. New York University, New York, NY, USA.
  5. Northeastern University, Boston, MA, USA.
  6. Interdisciplinary Scientific Research, Seattle, WA, USA.
  7. Northwestern University, Evanston, IL, USA.
  8. University of California-San Diego, La Jolla, CA, USA.
  9. Columbia University, New York, NY, USA
  10. Cornell University, Ithaca, NY, USA.
  11. Boston University, Boston, MA, USA.

原文:http://blog.sciencenet.cn/home.php?mod=space&uid=64458&do=blog&id=229840

翻译:许小可(xiaokeeie@gmail.com)

我们生活在各种网络中。我们定期检查电子邮件,在各处拨打移动电话,刷卡乘坐交通工具,使用信用卡购买商品。在公共场所,可能有监视器来监控我们的行为,在医院,我们的医疗记录以数字形式被保存。我们也很可能写博客给大家看,通过在线社会网络来维护友谊。以上的种种事情都留下了我们的数字脚印,这些踪迹汇聚起来就成为一幅复杂的个人和集体行为图景,同时这些踪迹也有可能改变我们对人生、组织和社会的理解。

虽然收集和分析海量数据的能力已经改变了一些领域如生物学、物理学等,但是数据驱动的“计算社会学”研究却进展缓慢。尽管在经济学、社会学和政治学上的重要期刊都很少关注这一领域,但计算社会学在国际公司如Google、Yahoo以及政府部门美国安全局已经开始被研究。计算社会学要么是私人公司和政府部门的专有研究领域;要么虽然某些有特权的研究者使用私有数据发表论文,但这些数据却无法被其他人评价和复制。上述的场景毫无疑问都无助于公众在知识积累、验证和分发上的长期利益。

基于一个开放的学术环境,计算社会学的价值在哪里?能够增强社会对个人和集体行为的理解吗?什么是计算社会学发展的障碍呢?

到目前为止,有关人类关系方面的研究主要依赖一次性的、自己报告的数据。新的科技,像视频监控(1),电子邮件和“智能”姓名标记这些手段不仅提供了随着时间发展,在不同时刻的交互关系,而且提供了结构和内容两方面关系信息。例如,团体中的交互关系可以使用电子邮件数据来研究,有关人们交流随时间变化的动态特性等问题也可以被考察:像工作团体是已经稳定下来很少变化,还是他们关系随着时间发生剧烈变化(2)?什么样的交互模式对应着多产的团体和个人(3)?面对面的团体交流能够通过“社会测量法”来评定,而电子设备能够被人戴着从而时刻捕捉人们在物理上的亲密关系、位置、移动以及其他各种个体行为和集体交互等。这些数据有助于解决很多有趣的问题,比如在一个组织内部的亲近关系和交流模式,以及具有杰出表现的个人或集体的信息流模式等(4)。

我们也能够了解社会的“宏观”社会网络信息(5),以及它怎么随着时间进行演化。电话公司拥有数年间他们客户之间通话模式的记录,电子商务门户网站像Google、Yahoo等拥有客户相互交流的即时信息数据。这些数据能够描绘社会通信模式的复杂图景吗?这些交互活动中的哪些方式会影响经济生产力或公共健康?不管怎么样,现在追踪人类活动已经变得很简单了(6)。移动电话提供了一种大规模长时间追踪人们移动和物理上是否亲密的方法(7)。这些数据或许会提供有用的流行病学方面的见识:比如一个病原体,像感冒病毒是如何通过物理上的相互接触而在人群中传播的。

互联网提供了一条完全不同的途径来理解人正在说什么以及人们是怎么连接到一起的(8)。例如,在这个刚过去的政治季节中,只要跟踪一下论点、谣言、政治观点以及其他线索在博客空间的传播(9),以及个人在互联网上的“网上冲浪”行为(10),每一个选民究竟关心什么东西就很清楚了。虚拟世界能自然而然地完全记录每一个人的行为,这也为研究提供了更多的可能性-很多实验在现实中是不可能做和也不被接受的(11)。相似的,社会网络在线站点提供了独特的途径去理解一个人在网络中的地位对整个组织的影响,从他们的感受到他们情绪和健康(12)。自然语言处理已经开始不断增强组织和分析互联网以及其他来源的大量文本材料的能力(13)。

简短地说,计算社会科学正在像杠杆一样以前所未有的方式不断增强我们收集和分析数据的宽度、深度和广度。然而,不容易克服的障碍却影响着这一进程。目前存在的方法不能处理数以兆计的时刻变化的整个人类个体之间的交互关系和位置。例如,目前存在的社会网络理论是往往是通过几十个人的一次“快照”得到的数据建立起来的,它怎么能告诉我们有关百万计人口的各种信息之间的相互关系,这些信息包含这些人的位置、商业交易和日常交流等数据。这些大量涌现的人与人之间相互交互的数据能够定量地提供有关人类集体行为的新观点,但是目前我们的研究框架却无法处理这些数据。

从博客空间得到的数据。上图显示的是政治博客社团之间的链接结构(从2004年开始)。红色线代表保守派博客,蓝色线代表自由派博客;橙色线代表自由派连向保守派,紫色线表示保守派连向自由派。每个博客的大小反映了其他博客连向它的数量。

也有一些制度上的障碍来阻止计算社会学前进。从途径上看,物理和生物学上探索的问题更适应观察和干涉。在发现的过程中,夸克和细胞都不介意我们揭开他们的秘密,也不抗拒我们改变他们的环境。对于基础结构来讲,社会学和计算社会学之间的鸿沟要比生物学和计算生物学之间要大得多,原因主要是计算社会学需要分布式监控,追踪允许以及编码等。这些在社会学中几乎都没有资源可以利用,甚至从物理距离和管理形式上来看,社会学系和工程或计算学系之间的差异要比其他科学之间大得多。

可能最痛苦的挑战是如何保证数据可以获取而又很好保护个人隐私。很多数据都是有所有权的(如移动电话数据和商业交易信息等)。由AOL公司公开它的很多客户“匿名化”搜索记录所造成的大混乱突出了个人或公司通过私人公司分享私人数据的潜在风险(14)。在工业界和学术界之间合作和数据共享的鲁棒模型是必需的,从而来促进研究、保护个人隐私以及为公司提供保护。更一般的讲,恰到好处的处理隐私问题是最基本的。最近美国国家研究委员会有关地理信息系统的报告就特别指出,他们可能会经常性的去掉个人外形特征,并且会仔细地匿名化数据(15)。去年,美国国家健康局和The Wellcome Trust突然去掉了一些基因数据库的在线获取功能(16)。这些数据看起来已经匿名化了,仅仅报告了某些基因标记者的总体频率。然而研究表明,在统计上,如果利用数据库中所有个体的全部数据,还是有可能重新确认个体身份的(17)。

因为一条个别的违背保护隐私的小事故就会导致扼杀新生的计算社会学的制度和法律条文产生,所以自我调整的与手续、技术和规则都相关的制度必须要建立起来,从而降低风险,保护潜在的研究。作为自我调整制度的基石,美国机构审查委员会(IRBs)必须增强他们的科技知识来理解入侵和伤害个人的潜在因素,因为新的可能性已经无法用他们当前有关伤害的范例来判断了。很多IRBs的人员很难来评估复杂数据被去匿名化的可能性。而且,IRBs可能需要检查一下是否有必要建立一个专注于保护数据安全的机构。目前,已有的数据在许多组织中传播,这些机构对于数据安全的理解和处理手段是参差不齐的。研究者必须在保留数据做研究的同时开发技术来保护个人隐私。同时,这些系统反过来可能也有助于对于工业界保护客户隐私和数据安全(18)。

最后,计算社会学的发展和其他新兴交叉学科也息息相关(像可持续性发展科学),这就需要发展一个方式来培养新的学者。决定教授职权的委员会和编辑部需要理解和奖赏跨学科发表的努力。最初地,计算社会学需要拥有社会学家和计算学家一起努力。长期地看,这个问题将取决于学术界决定是否应该培养计算社会学学家,或者计量文献社会学家和社会文献计量学家的团队。认知科学的出现为计算社会学的发展提供了一个很好的范例。认知科学涉及的领域包括生物学、哲学和计算科学。它已经吸引了大量资源的投入来创建一个共同领域,而且为过去一代的公共货物作出了很大贡献。我们认为计算社会学具有相似的潜力、值得相似的投入。

链接:http://computational-communication.com/计算社会科学/what-watts-says/

Question 1: Network Science and Big Data

What are your impressions of the way that network science has gone? A lot of it increasingly (since small worlds especially) focuses on the shape of the network, rather than the attributes of nodes, do you think that’s the right way forward? Is there anything big missing from network sociology, or a direction that you think it should be going in? Will “small data” networks be drown out by big data?

How is Network Science going on?

If you look back at my original paper with Strogatz it has “collective dynamics” right there in the title—it was always the relationship between structure and behavior that we thought was interesting, not structure for its own sake.

We also didn’t intend for the “small world” model that we proposed to be interpreted as a realistic model of network structure; rather we were trying to make a conceptual point that even subtle changes in micro-structure could have dramatic effects on macro-structure and hence possibly also macro-behavior.

I’ve also come to believe that modeling exercises that are unconstrained by data have a tendency to gravitate toward phenomena that are mathematically interesting, which is no guarantee of empirical relevance. Fortunately I think that in recent years we’ve seen more emphasis on studying both network structure and collective behavior empirically.(最重要的是数学推导与经验知识的一致性)

What questions truly need big data?

For some questions, such as when we are interested in rare events or estimating tiny effect sizes, it is indeed necessary to have a very large number of observations; in some of our recent work on diffusion, for example, it turns out that a billion observations is not excessive. But for other questions, the scale of the data is much less important than its type or quality. Sometimes it matters that a sample is unbiased or representative; other times it is important to have proper randomization in order to infer causality; and other times still it is important simply that you have instrumented the outcome variable of interest. Regardless, the point is that the data is relevant to the question you’re asking, not how big or small it is.

Question 2: Why socialogical imagination exists but not economic thinking?

Economists have been pretty successful at clearly articulating a set of core concepts that have spread out into the broader world and form the basis for economic thinking: supply and demand; markets; externalities; people respond incentives; perverse incentives; sunk costs; exogenous shocks; etc. Since your early work brought together core sociological concepts (namely social influence and the Matthew effect): i) Do you think sociology should try to reorganize itself around core, “Soc 101” concepts that every introductory class would cover? We often talk about the sociological imagination, but that is much less clear than economic thinking.

The social reality is too complex

This is a tough one. One of things I’ve always liked about sociology is its embrace of multiple viewpoints, both in terms of theory and methodology. Personally I think social reality is too complex to be adequately accounted for by any single theoretical framework—a point that Merton made very eloquently many years ago in his article on middle-range theories. Unfortunately I don’t think his argument was properly understood at the time (e.g. by rational choice theorists) and I don’t think it is still.

The advantage of economics

Perhaps that’s because simple universal frameworks are institutionally powerful even when they’re scientifically questionable. And that’s why it’s a tough question: because I think that one reason why economics is so much more influential than sociology in government, in the media, and in society, is precisely because economists can articulate a fairly coherent worldview that they can all (by and large) get behind, whereas sociologists can’t really agree on anything. Economists are therefore in a much better position to offer answers to questions that people care about, whereas sociologists tend to point out all the ways in which the question is more difficult than the questioner realized. Even if the sociologist’s response better reflects our true understanding of the world, it’s no surprise that most people would prefer to listen to the economist.

Sociologist should try to solve some nontrivial but solvable problems to reach a consensus.

That said, I wouldn’t advocate sociology trying to develop a single set of core concepts just to compete with economics. Rather I would propose that sociologists identify a small set of nontrivial real-world problems that we believe we can actually solve, or at least make some meaningful progress towards solving, and then demonstrating that progress. Identifying nontrivial but solvable social problems isn’t easy, nor do I think that solving problems is the only measure of progress in a discipline. So I certainly wouldn’t advocate that everyone drop what they’re doing to work on these problems, or even try to agree on what they should be. But I do think that being able to point to a set of problems that sociologists have arguably “solved” would greatly enhance our collective reputation and help us to attract more students.

ii) If you could pick a handful of sociological concepts and then have everyone outside of sociology learn them, and they’d be as familiar as the economic examples listed above, what would they be?

A book called Everything is Obvious: Once you Know the Answer

I wrote a book a few years ago called Everything is Obvious: Once you Know the Answer about the failures of commonsense reasoning and how we systematically ignore them. I think the contents of that book is pretty close to the list of concepts I would like everyone to understand, including: the nature of common sense itself; the difference between rational choice and behavioral conceptions of individual decision making; cumulative advantage and intrinsic unpredictability; the fallacy of the representative individual; the perils of ex-post explanations and dangers of “overfitting” to known outcomes; the consequences of overfitting for predictions about the future; and the implications of all of these problems for practical matters of predicting success, rewarding performance, deciding what is fair, and even what is knowable. I wouldn’t claim that these concepts constitute a core of knowledge comparable to core concepts in economics, nor do I think it would help students directly solve real-world problems of the kind I just advocated for, but I do think it would teach students some epistemic modesty and might eventually lead to more intelligent public discourse about these problems. I’m not sure the book has accomplished any of that, but that’s why I wrote it.

Question 4: Try to compare sociology PhDs with industry demands

You made a transition from academia to the private sector. One of the ways that people have suggested improving the sociology PhD job market should is to make work in the private sector a clearer option from the beginning. What do you think about sociology PhDs and the private sector? How do you think sociology PhDs could or should be going about? What skills should they develop? How should they present themselves to companies?

Big data need theoretical knowledge to build valuable insights.

It’s true that companies are increasingly excited about extracting value from data, which has made data science a very in-demand skill set. I also think that companies are starting to appreciate that truly valuable insight requires more than just good computational and statistical chops—some degree of theoretical knowledge is also required in order to ask the right questions, define the right metrics, and avoid basic errors of sampling bias, causal inference etc. This latter trend is much earlier in its life cycle than the former, but I think as companies learn more about the complexities and compromises associated with “big data” they will increasingly demand data scientists with social scientific training.

The bright side and downsides for sociology PhDs to go to industry.

So on the bright side I think that there is real potential for sociology PhDs to find intellectually rewarding work in industry. The downside is that in order to realize any value from their sociological training they also need a level of technical skill that is well beyond what students can expect to learn in the vast majority of sociology PhD programs. In our postdoc hiring we are starting to see a handful of strong candidates with sociology PhDs—up from zero just a couple of years ago—so that’s encouraging. But I suspect that these students mostly figured it out on their own or took it up themselves to find the relevant courses in other departments. Which is fine, and if I were a current sociology PhD that’s what I would do, but I think it would be better for the field to provide a more systematic level of training.

Question 5: The differences between writing for AJS and Natures

You’ve been very successful publishing in journals such as AJS, while targeting broader audiences through high-impact journals such as Nature and Science. How is writing for an AJS audience different from writing for Nature and Science? Where would you send your manuscript if was rejected by Science? Do you think more sociologists should be looking to publish beyond our traditional journals in order to reach a broader community of scholars?

Different readers need different writing strategies.

Writing for AJS is completely different from writing for Nature and Science in almost every sense: length, style, treatment of related literature, acceptable methodology, conception of theory, presentation of results…everything. It’s also different from writing for computer science conference proceedings and physics journals, and all of those different outlets are also different from one another. I also occasionally write magazine articles, op-eds and trade books, and those are also all different in their own way. Learning to write for different disciplinary outlets and in different styles is time consuming and sometimes frustrating—because different groups of readers care about such different things. But I think it’s an effort that sociologists should make.

Try speak to CSers in their languages

The fact is that with very few exceptions researchers in other disciplines don’t read sociology journal articles, and when they do they find them incredibly long and tedious. For example, all that effort that we devote to situating our work is completely lost on most computer scientists, so when they get to the results section they wonder why it was necessary to write 40 pages in order to explain one table of regression coefficients. Given that CS is a much bigger and more powerful discipline than sociology, if want to have an impact on them or convince them that we are worth taking seriously, we will have to speak to them in their language and probably in their own publication venues.

A single high impact paper is worth many low impact papers.

On the other hand a single high impact paper is worth many low impact papers, so from a career perspective it’s not necessarily a waste of time to devote a year or two to getting something into a top journal. I do often wish that we could find a more efficient way to publish our research without compromising quality, and in that regard online-only, open-access journals like PLoS One and Sociological Science have some appealing properties. But the reality is that we live in a highly competitive world where attention is scarce; so my fear is that if we stopped using A-journal publications as a differentiator, the likely substitute (relentless self-promotion on social media anyone?) might be even worse.

Question 6: How to choose between academia and industry

You spent several years at the Columbia Sociology Department. During your time there you mentored several prominent junior scholars including Baldassari and Salganik. How was your experience being an academic sociologist and why did you decide to leave for industry? Will you consider returning to Academia?

Why leave Columbia for Yahoo! Research?

I really loved my time at Columbia but around 2006 it started to dawn on me that, whether it liked it or not, sociology was going to become a computational science, much as biology had become a computational science in the early 1990s. All around us social data were exploding in volume and variety, from email to social networking services to online experiments of the kind I did with Matt (Salganik). It also occurred to me, however, that sociologists weren’t well equipped to handle this transition and that if we were going to make rapid progress we would need to the computer scientists to help, and possibly psychologists and economists as well. Columbia is now pretty open to interdisciplinary collaborations of this sort, and their data science institute is a great example of that openness, but at the time it was very hard to see how it would work within the confines of traditional academic departments.

Suffering from Academia

I was also having difficulty recruiting grad students with rigorous mathematical and computational backgrounds (as you noted there were some like Matt and Delia and also Gueorgi Kossinets, but they were really the exceptions), and raising funding to support the whole thing. Towards the end I felt like I was spending all my time writing grant proposals or sitting in meetings and almost no time doing actual research. So when Prabhakar Raghavan called me from Yahoo! to ask if I would come and help them set up a social science research unit it was very tempting. Even then I wasn’t sure I would do it, and certainly didn’t expect to do it for long, but it really worked out wonderfully and now I’ve been at Yahoo! and Microsoft Research for longer than I was at Columbia.

Doing Research more purely at Microsoft or Yahoo!

Perhaps surprisingly, I think the biggest difference between my experience at Columbia compared with Microsoft (or at Yahoo!) is that I now spend much more time doing and thinking about research. The other big difference is that, in contrast with most university faculty, I am surrounded (literally—we all sit in cubicles in an open plan office) by researchers from different disciplinary backgrounds including psychology, economics, physics, and computer science.One of my colleagues once observed that university departments comprise lots of people with similar training interested in different problems, whereas research labs like ours comprise lots of people with different training interested in the same problems. I think that’s roughly true, and it completely changes the nature of how we work, which is highly collaborative, interdisciplinary, and very problem oriented. That is not to say that we only do “applied” research—we do some of that but we also do a lot of basic science and publish all our work in all the same venues as our colleagues in universities. Rather what it means is that we are more concerned with the relevance of our work to real-world problems and less concerned about what particular disciplinary tradition it fits into.

Would I ever consider returning to academia? I don’t know. I’m very happy at Microsoft right now: I work with fantastic colleagues, we get amazing PhD student interns every summer, and we work on a wide variety of extremely interesting problems. It’s been a great experience and every day I’m grateful to have the job that I have. So although I wouldn’t rule out returning to academia one day I’m not in any hurry to leave.

Question 7: Common sense and its importance for sociology

You recently wrote an article on common sense and its importance for sociology. What was the intuition for it?

Sociologists conflate causal explanations with explanations that “make sense” of outcomes they have observed

As I mentioned, I recently wrote a book about how people rely on common sense more than they realize, and in so doing end up persuading themselves that they understand much more about the world than they actually do. In course of writing the book, it occurred to me that sociologists make many of the same mistakes that other people do. Just like other people, that is, sociologists conflate causal explanations with explanations that “make sense” of outcomes they have observed, unconsciously substitute representative individuals for collectives, overfit their explanations to past data, and fail to check their predictions. I didn’t belabor this point in the book because, as I mentioned earlier, I wanted it be an advertisement for sociological thinking not a critique of it. Nevertheless I thought the implication was pretty clear, so I was disappointed that some of my colleagues who liked the book’s appeal to non-sociologists seemed to think it had nothing to say to them. I decided that if I wanted them to get the message I would have to sharpen it up a lot, and also make it a bit more constructive; so that’s what I tried to do in that paper.

Question 8: Changes needed for Sociology department and some tips for current sociology graduates

If you were in charge of a Sociology department and could implement any change you’d like, what specific changes would you introduce to its graduate training program? Is there something that current sociology graduate students aren’t doing that they should be doing?

Changes needed for Sociology department

As I mentioned earlier, I think that a data science sequence (e.g. data acquisition, cleaning and management; basic concepts and programming languages for parallel computing; advanced statistics, including methods of causal inference; some basic machine learning; design and construction of web-based experiments) would be super useful for sociology graduate students, and would make them both better social scientists and also much more attractive to prospective employers. There are already a handful of courses of this sort being trialed in various places, including Stanford, Columbia, and Princeton, and sociology departments could work with their colleagues in other departments to pull together a reasonable sequence from existing pieces. It would take some effort and probably resources, but I don’t think it’s unfeasible.

Some tips for current sociology graduates

In the meantime, as I mentioned earlier: if I were a current sociology grad student, I would be busy taking courses in computer science and statistics to augment my sociology training. I would also look around for any groups doing computational social science and ask to join them.

It is an adventurous thing to join a new interdisciplinary fields like computational social science.

The downside of new, interdisciplinary fields is that nobody really knows what is involved or what the standards are, so you have to be prepared to take some risks and also to feel out of your depth much of the time. The upside is that it can be incredibly stimulating, and there is the possibility of doing something genuinely new. I think computational social science is in that phase now, so it’s a great time for ambitious and creative students to dive in and see what they can do.

Sociological theory, if it is to advance significantly, must proceed on these interconnected planes: 1. by developing special theories from which to derive hypotheses that can be empirically investigated and 2. by evolving a progressively more general conceptual scheme that is adequate to consolidate groups of special theories.

— Robert K. Merton, Social Theory and Social Structure

为什么许多经典的社会学著作读起来那么艰深晦涩?因为早期的社会学从哲学研究继承而来,因而过于抽象。对此,默顿提出了中层理论(Middle-Range Theory)。

中层理论原则上应用于社会学中对经验研究的指导。中层理论介于社会系统的一般理论和对细节额详尽描述之间。……中层理论涉及的是范围有限的社会现象。

…what might be called theories of the middle range: theories intermediate to the minor working hypotheses evolved in abundance during the day-by-day routine of research, and the all-inclusive speculations comprising a master conceptual scheme.

— Robert K. Merton, Social Theory and Social Structure

Our major task today is to develop special theories applicable to limited conceptual ranges — theories, for example, of deviant behavior, the unanticipated consequences of purposive action, social perception, reference groups, social control, the interdependence of social institutions — rather than to seek the total conceptual structure that is adequate to derive these and other theories of the middle range.

— Robert K. Merton

胡翼青在其《传播实证研究——从中层理论到货币哲学》一文中指出:

我认为,如果真的像默顿所设想的那样,中层理论指导下的实证研究范式对传播研究当然有重要的认识论意义。……默顿的中层理论也不是完美的理论主张,也有两个值得商榷的前提,其一是默顿把社会学当做是一门像物理学一样,可以通过理论的积累来最终形成完善理论体系的;其二是他认为社会学理论是会向着某个方向不断进步的。而这两个前提都是不可靠的。

以下内容摘自《社会理论和社会结构》 Robert.K.Merton 第二章 论社会学的中层理论

社会学理论的总体系

比起对无所不包的统一理论的探索,对中层理论的探索要求社会学家信奉完全不同的东西。

早期社会学是在创立高度综合的科学体系的学术气氛之中成长起来的。……几乎所有的社会学先驱者都试图建立他自己的体系。每一个体系都宣称是真正的社会学。

应该以发展的眼光来看待社会科学中大多数理论体系的建立与自然科学中学说和同类体系建立的不同。在自然科学中,理论和描述体系都是随着科学家增长着的知识和经验而完善的。在社会科学中,体系常常脱颖于一个人的心智。如果这些体系引起人们注意,那么它们就得到更多的讨论。但是,让多方合作共同努力使其不断完善的情况则很罕见。——L.J.亨德森(生物化学家和业余的社会家)

现在所谓的社会学理论,很大一部分只是对于数据的一般定向,它们提供理论必须说明的各类变量,而不是对具体变量之间的关系做清晰的表述和可验证的判断。

爱因斯坦尽管自己执着而孤独地献身于这一追求,然而他承认:

物理研究的大部分工作,是致力于发展物理学的各个分科,每一分科的目的都是要对那些有限的经验作理论上的解释,而且每一分科中的定律和概念都要同经验保持尽可能密切的联系。

那些希望在我们时代或不久之后能有一个坚实的、普遍的社会学体系的社会学家应该好好想想这些论述。如果自然科学经过几个世纪的积累的理论概括都未能产生出一个包罗万象的理论体系,那么更有说服力的是,社会学这门科学只是刚开始在有限范围内积累依赖经验而获得的理论概括,它似乎最好节制一下它对这种体系的渴望。

对社会学总体系的功利主义压力

我对于社会学家面临的实际问题与他所积累的知识和技艺之间这条鸿沟的强调,当然不是说社会学家不应去寻求去发展日益综合的理论或者不应从事研究直接相关的迫切的实际问题。更不是说社会学家应该可以去研究琐碎的实际问题。各种基础研究和理论概括都相应地与特定的实际问题密切相关。至少也有着潜在的相关。但是,重新确定相应的历史意义是很重要的。社会实际问题的紧迫或重大并不保证这一问题就能得到及时解决。在任何特定的时候,科学工作者都只能解决某些问题,而对其他问题一筹莫展。

理论总体系和中层理论

由此看来,有理由认为:到目前为止,社会学把主要关注点(但不是唯一关注点)放在发展中层理论上,它就能够取得进步;如果它的主要精力集中于发展大而全的社会学体系,则它就会停滞。

社会学理论要想获得有意义的进步,就必须在这两个相互联系的层次上发展:(1)创立可以推导出能够接受经验研究的假设的特殊理论;(2)逐步地而非一蹴而就地发展概括化的概念体系,即能够综合各种具体理论的概念体系。

完全专心于特殊理论就会冒这种风险:提出一些解释社会行为、组织和变迁的有限方面而又自相矛盾的特定假设。

完全专心于能从中推导出所有局部理论的权威概念体系就要冒把二十世纪的社会学引向过去庞大的哲学体系的风险,后者花样繁多,体系壮观,但灼见贫乏。那种完全局限于对极度抽象的总体系进行探索的社会学理论家,其思想就如同时髦装饰品,空洞无物而令人生厌。

如果像社会学早期那样,每一个魅力超凡的社会学家都试图创立自己的总理论体系,那么通往真正的综合社会学理论的道路就会被阻塞。……社会学的理论发展表明,强调中层理论的方向是必要的。

现在所谓的社会学理论,很大一部分只是对于数据的一般定向,它们提供理论必须说明的各类变量,而不是对具体变量之间的关系做清晰的表述和可验证的判断。我们有许多概念,但很少有已被证实的理论;有许多观点,但很少有定理,有很多“方法”,但很少能实现。也许在重心行的进一步变化是非常有益的。

我们对社会学中层理论的讨论就是打算澄清所有社会学理论都面临的一个决策问题:哪一个研究方向应该更多地使用我们的集体能力和经历:是对可验证的中层理论的研究还是对无所不包的概念体系的研究?我相信——尽管信念是容易遭到误解的——只要人们在从事中层理论研究时能普遍注意将专门理论整合为较少抽象的概念和相互吻合的命题,中层理论就将是最有前途的。

小体系也能成事,它们同样生灭有时。——我的兄长们和丁尼生的暂时性看法

更为综合性的理论的发展是通过对中层理论的整合来进行的,而不是突然大规模地从个别理论家们的工作中出现的。

正如我在别处指出的,这一策略既不是新的也不是外来的,它深深扎根于历史之中。培根比所有前辈都更加强调科学上“中级原理”(middle axioms)的重要性。

唯有中级原理才是真正的、坚实的和富有活力的,人们的事务和前程正式依靠着它们,也只有由它们而上,最后才能达到真正最普遍的原理,并且不再是那种抽象的,而是与中级原理相关的最普遍的原理。柏拉图在他的《泰安泰德篇》里说得好:“细节是无穷的,而较高的概括不能给予明确的指导。”使得专家不同于普通人的全部科学的精髓则在于中级命题,在特殊知识里,这种中级命题源自传统和经验。——培根

……我们能够形成有限的理论,能够预测一般趋势和普通的因果律;如果它们扩大到全人类,在很大程度上可能就不真实了。但是,如果把它们限于一定的国家,它们就具有某种真实性……通过缩小观察范围,通过把自己限定在某些类型的社群上,并且如实地表述事实,就有可能扩大政治理论的范围。通过采取这种方法,我们能够增加从事实中得出的真正政治公理的数量,同时,使这些公理更加充实、生动和牢固。与仅仅是空洞无物的概括相反,它们类似于培根的中级原理;这种原理是对事实的概括表述,但非常接近实际,可作为生活事务的指南。——乔治.康沃尔.刘易斯

涂尔干的专著《自杀论》(Suicide)也许是运用和发展中层理论的经典实例。

With the introduction of the middle range theory program, he advocated that sociologists should concentrate on measurable aspects of social reality that can be studied as separate social phenomena, rather than attempting to explain the entire social world. He saw both the middle-range theory approach and middle-range theories themselves as temporary: when they matured, as natural sciences already had, the body of middle range theories would become a system of universal laws; but, until that time, social sciences should avoid trying to create a universal theory.

– Mjøset, Lars. 1999. “Understanding of Theory in the Social Sciences.” ARENA working papers.

Network Diversity and Economic Development

Nathan Eagle, Michael Macy, Rob Claxton

目标是研究一个国家的社会网络结构对社会发展的影响(the social impact of a national network structure)

前人研究表明,商机往往出现在本地熟人朋友圈之外的社会关系中。

heterogeneous social ties may generate these opportunities from a range of diverse contacts(1,2)

  1. M. Newman, SIAM Rev. 45, 167 (2003).
  2. S. Page, The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies (Princeton Univ. Press, Princeton, NJ, 2007).

之前由于数据难以获取,导致网络多样性(network diversity)与人群经济状况(population’s economic well-being)之间的关系不能被量化的研究

以往在个人层面的研究已经表明,跨社群的社会关系会给个人的发展带来益处。Previous studies have found that individuals benefit from having social ties that bridge between communities.These benefits include:

  • access to jobs and promotions (5–13)
  • greater job mobility (14, 15)
  • higher salaries (9, 16, 17)
  • opportunities for entrepreneurship (18, 19)
  • increased power in negotiations (20, 21).

Although these studies suggest the possibility that the individual-level benefits of having a diverse social network may scale to the population level, the relation between network structure and community economic development has never been directly tested (22).

研究意义:

As policy-makers struggle to revive ailing econ- omies, understanding this relation between net- work structure and economic development may provide insights into social alternatives to traditional stimulus policies.

研究数据:

The communication network data were collected during the month of August 2005 in the UK. The data contain more than 90% of the mobile phones and greater than 99% of the residential and business landlines in the country.

The resulting network has 65 × 106 nodes, 368 × 106 reciprocated social ties, a mean geodesic distance (minimum number of direct or indirect edges connecting two nodes) of 9.4, an average degree of 10.1 network neighbors, and a giant component (the largest connected subgraph) containing 99.5% of all nodes (23).

引入IMD

Although the nature of this communication data limits causal inference, we were able to test the hypothesized correspondence between social network structure and economic development using the 2004 UK government’s Index of Multiple Deprivation (IMD), a composite measure of relative prosperity of 32,482 communities encom- passing the entire country (24), based on income, employment, education, health, crime, housing, and the environmental quality of each region (25). Each residential landline number was associated with the IMD rank of the exchange in which it was located, as shown in Fig. 1.

用香农熵计算Topology Diversity

We developed two new metrics to capture the social and
spatial diversity of communication ties within an
individual’s social network. We quantify topological diversity as a function of the Shannon entropy

High diversity scores imply that an individual splits her time more evenly among social ties and between different regions.

Diversity was constructed as a composite of Shannon entropy and Burt’s measure of structural holes, by using principal component analysis(PCA). A fractional polynomial was fit to the data.

新媒体环境下传统媒体的变革——基于受众研究的视角来反思内容生产和发布

陈志聪 MF1611002

本学期,江苏广电集团节目研发与受众研究中心的王宁老师为我们做了《受众研究对电视节目创新及制作的意义》讲座,本文即从受众研究这个视角来谈一谈传统媒体在新媒体时代,如何在内容生产和发布上进行应对和变革。

在媒介研究中,受众研究格外重要。因为所有媒介研究的研究意义“要看这些分析最终能不能在媒介对读者和受众的影响性质方面有所阐发”(奥利弗.博伊德-巴雷特,2004)。经过近一个世纪的研究,至今已创建和发展出了不下于几十种的受众理论及媒介效果理论(McQuail,1997)。

王宁老师在讲座中提到:“英国广播公司(BBC)有超过80年的受众研究……2010-2011年省级卫视研发的242档新节目,到2012年仅存48档,淘汰率高达80%。为了更为全面的把握受众的特征和习惯,内容生产者必须非常重视受众研究。”新闻与传播学科中的受众是指新闻传播活动中新闻信息的接受者,也叫受传者,通常称为“受众”。它既包括大规模新闻传播中的群体——如报刊的读者,广播媒介的听众和电视媒介的观众,也包括小范围新闻传播中的个体——如一场讲座的参与者和对话者。在知识信息化社会,新的传播媒介——互联网络上的信息接收者也是新闻传播受众研究的内容。

受众对传统媒体的重要地位

受众的概念具有相对性。在人类社会信息传播系统中,传播过程并不是单向的流程,而是具有双向流动的特点。传播者发出信息之后,受众总要做出或积极,或消极,或接受,或拒绝等程度不同的反应。这些反应,就是反馈。传播者根据反馈信息来进一步调整自己的传播行为,以便取得预期的效果。很显然,传播者和受众总是处在信息互动之中,二者之间,角色是可以互换的。传播过程中的受众,在反馈信息过程中变成了传播者,原来的传播者角色与对方互换了。因此,受众只是相对传播者而存在的概念,二者相互依存,同时产生,并常常互换角色。在信息传播的链条中,这种角色的互换是信息沟通、回环流动、信息增殖以及逐步达到最佳效果的基础。

在新闻传播者和新闻受众这一对矛盾中,谁是矛盾的主要方面?历史发展的客观规律表明,受众是矛盾的主要方面和决定性因素,在传播中占有十分突出的地位,发挥十分重要的作用。 受众是新闻传播活动中信息流动的目的地,是新闻传播链条的一个重要环节,也是新闻传播过程得以存在的前提和条件。受众又是新闻效果的评判者。没有受众的反应和评价,就不能真正了解新闻传播媒介的效能和效率。受众实际上决定了新闻传播活动的基本方向。

任何新事物的问世,首先在于社会对这一新事物的需要。从新闻传播的历史发展来看也是如此。人生活在一个瞬息万变的世界里,为了生存和发展,人就要不断了解外界信息,减少对外界变动状态的不确定性。这样交换信息的传播就产生了。最早的传播是人际传播,范围和影响都很有限。随着社会的发展和生产力的不断提高,原有的人际信息交流远远不能满足社会大多数人的需要,在这种情况下,专门为满足社会大众新闻信息需求的行业便应运而生。随着科学技术的进步和传播手段的不断更新,新闻传播事业便成为人类社会不可缺少的专门发布新闻信息的机构。简而言之,新闻传播行为是人们解决生存问题的客观需要,新闻传播事业是满足社会大规模新闻信息需求的产物。近代新闻传播业产生以后,新闻媒介及时、全面地反映客观世界,使公众社会意识形成,由直接亲历的经验世界转变为新闻媒介塑造的观念世界来完成。可见,在客观世界信息和社会大众之间,新闻传播者扮演了中介和桥梁的角色,是传播过程中的中间环节,但始终不能起决定性作用。因为其传播行为至少要受到新闻事实的客观性和受众需要与接收心理规律的制约,超越了这两个因素的制约,就会导致虚假新闻和被受众拒绝接收的局面。

与此相反,受众虽然处在传播过程中信宿的位置,具有决定的作用,在传播诸要素中,是最自由的,不受限制的一个。受众是新闻信息的接收者。新闻传播过程首先是指在一个特定的关系结构内,新闻信息由传播者向受传者的流动过程。也就是说,受众在新闻传播过程中是作为接收新闻信息的一方存在,是新闻信息传播的目的地和终点。作为人类有意识、有组织的社会行为,新闻传播者的主观目的、信息内容的传播价值、传播过程的保真度等,最终要通过受众的接受和理解才能实现。要说明的是,受众在接受新闻信息时的具体表现并不都是一致的,换言之,受众在接收程度上会产生很大的差异。

新媒体时代受众的新特征

在新闻传播活动中,传播者要使信息传播有效地进行,就必须了解受众,明确受众的基本特点。由无数个体汇集而成的受众,其群体特征是怎样的呢?就受众在空间上分布、存在的态势看,受众的特点主要表现在以下五个方面:(1)人数众多。一家全国性的报纸,发行量常在百万份以上;一家全国性的电视网,其观众可达数亿人。(2)分布广泛。新闻传播的受众在空间或者说地域位置上的分布十分广泛,难以限定,可以遍及地球上的任何角落。(3)成分复杂。受众是由不同民族、不同国家、不同阶级、阶层、不同地位、不同职业的社会成员所构成的集合体,其成分异常复杂,其个性特征更是千姿百态。(4)流动变化。外在的流动指人们在地域上的流动,内在的流动指由社会成员职业、地位的改变而导致的社层流动。(5)彼此隐匿。受众彼此分散居住,有的相距千里;这些人行踪不定,互不相识,素无关系,相互匿名;他们之间既无接受协议,也无有关接受准则;无共同感受,也无共同意识;在接受中,他们对别人无统制功能,别人对他们也无可奈何。

进入新媒体时代后,随着新闻传播飞速发展的年代,传播技术日趋发达,互联网和卫星电视等新兴传媒将更加普及。在这样的背景下,新闻传播的受众至少会有以下几个新特点,它们将对新闻传媒的吸引力产生很大影响。

信息来源很多,选择余地很大,接受时的主动性很强。这对于人的发展,社会的进步,是十分有利的。而新闻媒介只有按照新闻规律和传播规律进行运作,才能在众多的传播者中脱颖而出,获得受众的选择。

在大量的信息面前,受众又需要在选择上获得帮助。媒介的信誉,品牌的质量,对受众的选择会有很大的影响,有的品牌甚至成为受众的依赖。

独立思考和判断的能力加强,个人的独立自主性也会相应增强。受众眼界开阔,文化程度高,独立思考、判断的能力和习惯增强,盲从度会大大降低。因此,受众对媒介质量的要求会更高,且不易被欺瞒和愚弄。媒介必须时时处处、方方面面都保持其真正的高质量,才能不断地吸引住受众。

对媒介的需求增强,需求的个性化程度提高。由于受众的经济能力强,闲暇时间多,文化程度高,因而媒介消费能力大为增强。又由于社会联系多,生活、工作等各种活动的社会化程度提高,人们对信息、娱乐、生活指导等需求以及自我表达的需求既多又强,受众和广告主对媒介的依赖程度也会不断提高,因而媒介消费欲望又会相应增强。许多受众会不满足于只接触一种日报、几个广播电视频道和少数网站。而且,广告也会大量增加,并具有更强的针对性。所有这些,都将给媒介带来新的机遇,同时又为媒介市场更加细分创造了条件。

此外,受众需要更优质的服务,包括符合它们个性化需求的传播。这使媒介小众化趋势更强。因此,媒介必须增强自己的特色,进一步提高对特定受众的针对性。

西方传统受众研究理论回顾

受众研究是当代新闻传播学的重要内容。它主要是从理论上研究受众参与新闻传播活动的规律,研究受众寻求信息和发出信息(反馈)的规律,并深入探讨受众的心理活动规律,寻找知道新闻传播活动的理论依据。关于西方的受众理论,美国传播学家梅尔文·德弗勒在《大众传播理论》(1975)一书中将其归纳为以下几种。

个人差异论。它由霍夫兰于1946年最先提出,并由德弗勒在1970年作出某些修正而形成的。这个理论以心理学“刺激—反应”模式为基础,认为“受众成员心理或认识结构上的个人差异是影响他们对媒介的注意力以及对媒介所讨论的问题和事物所采取的行为的关键因素。” 美国学者雷蒙德·鲍尔也提出相应的观点,他在一篇题为《顽固的受传者》文章中说:“在可以获得的大量内容中,受传者中的每个成员特别注意选择那些同他的兴趣有关、同他的立场一致、同他的信仰吻合、并且支持他的价值观念的信息。他对这些信息的反应受到他的心理构成的制约……传播媒介的效果在广大受传者中远不是一样的,而是千差万别的,这就是因为每个人在心理结构上是千差万别的。”

社会分类论,又称为社会范畴论,这一理论是对个人差异论的修正与扩展,是在个人差异论基础上将造成差异的原因进一步扩展到了社会的进化和变化之中。美国学者约翰·赖利与蒂尔达·怀特·赖利在论文《大众传播社会系统》(1959)中揭示了基本群体在传播过程中扮演的角色,从而首先进入这一理论的研究领域。他认为,”社会分化产生独特的行为方式。换句话说,相同社会类型成员身份的人常常行为类似”。 这些相同身份的人常常会对同样的信息感兴趣,并做出相近的反应,采取不同于其他社会类型的行为方式。传播者可按照性别、年龄、地区、民族、职业、工资收入、宗教信仰、文化程度等方面的异同,将受众分为不同的社会类型,然后有针对性地采写、设计、制作、传播讯息,使不同的讯息流向不同的受众,是能增强传播媒介的吸引力、提高大众传播的效果的。

社会关系论最早来自于美国学者拉扎斯菲尔德、贝雷尔森等于1940年所进行的关于新闻传播的报道对改变人们在总统竞选中投票态度的作用的调查。调查结果表明,新闻传播的作用并不象人们想象的那么明显、直接。许多人并不是根据新闻传播媒介提供的信息来决定自己的态度的,他们接受的是来自自己家庭成员、朋友以及其他人的信息。这当中呈现出了“二级传播流程”,那些活跃的“意见领袖”在其中发挥了重要作用。这种理论认为,受众都有自己特定的生活圈。这种生活圈可能是有纲领、有领导、有组织的团体,也可能是无纲领、无组织、临时性的非正式的团体,还可能只是邻里、家庭等群体关系。因此,新闻媒介的效果既非一致的、强大的,也非直接的;个人间的相互影响极大地限制和约束着传播效果。

新媒体环境下内容生产与内容发布的新策略

根据上述理论,新闻传播媒介在设计劝服性运动之前,就不能简单地从传播者的社会立场和态度入手,应先弄清讯息所针对的各种受众的特点,了解、利用来自受众的各种先天性经验、态度和后天性立场,然后依照传播对象的兴趣、需要、价值观、信念等,从尊重受众的个人态度的角度出发挑选与之相应的讯息进行因人而异的传播。受众是新闻生产的参与者。新闻生产实际上是受众共同合作参与的结果,受众同样是新闻的生产者。因为受众同样是一个充满能动性的主体。传播者在制作、发出新闻之前及其之后,都要认真分析并理解受众对象,以便双方达到默契。受众对新闻的接收事实上也是对新闻的加工过程。受众往往根据自己以往的经验以及需要,对新闻作出鉴别和理解。在这个过程中,受众期望满足自己的新闻欲,传播者企盼对受众产生某种影响,传、受双方信息共享的过程就是他们借助于共享而相互影响的过程。传播、分享、互相影响,然后再传播、分享,再产生互相影响,新闻就是在这样不断地相互作用中被生产、被传播。

从这样一个受众接受新闻信息的内在机制看,受众又具有四个特点:(1)自在性。受众不是某种臆想的东西,不是理论上的假设,而是十分具体的、有血有肉、有思想、有情感的客观现实。(2)自主性。受众不是新闻传播者的俘虏,可以任意驱赶;也不是新闻传播者的敌手,专门揭短拆台。(3)自述性。受众对新闻作品内容的感知与认识不全由记者给定,尽管记者对受众的选择与解释的自由度远不如作家的所给,但面对新闻信息每一位接受者仍然都会作出属于他自己的解释与阐述,并据此进行再传播。(4)归属性。受众虽是自发的、未经组织的人群,但这并不意味着他们无类可归、心无所系,恰恰相反,他们总是自觉或不自觉地将自己划归、登记在某一特定的接受群体之列。而新闻媒介也同样有意地把不同的新闻信息分类集中传播给不同的接受群体,如《中国青年报》、《足球》等报刊,和广播电视系统的经济台、文艺台等都是。

在新媒体时代,上述策略依然成立,但内容发布后的传播越来越依赖于社会化媒体,例如微博、微信等,也就是所谓的“社会关系论”。在社交媒体上的传播往往要借助受众的“转发行为”。通过上述分析,我们可以整理出受众转发内容的几种自发心理归因,从而为内容生产者提出一些相应的新策略。

  • 自我人设。这是李普曼的那些理论在互联网时代的最好的重现。每个使用微信的人,尤其是有影响力的人,他们使用微信时一定是在塑造他的自我形象。比如,通过转发内容来体现自己的娱乐精神,亦或是吐槽、自黑、反鸡汤等等。所以从内容生产者的角度而言,至少不能抵消用户的自我人设。
  • 情绪宣泄。被转发的(具有较强传播能力的)内容往往要有一定的争议性。这个道理很简单,没有争议,就没有关注。因此,内容生产者不应害怕内容有争议,关键是自己掌控尺度。很多情况下,还要学会利用争议,甚至创造争议。
  • 省时省力。有一种图方便的、或者说有点“偷懒”的内容生产方式,那就是做干货和做盘点。做盘点也是也是一种原创内容,但它是更巧妙的原创。很多人其实都会有这样的心理:“我的朋友不会乱转发东西的,尤其是那些我在乎的朋友。”通常,盘点类文章标题里的数字很关键。这样的内容生产,只要能引起人们的动作,就是好的,收藏也好,转发也好。
  • 有利可图。奖品、线下活动等等。在内容发布的同时,可以策划一些或大或小的活动相结合,例如定期举办大活动,例行举行线上小活动等等。

面对 20 世纪汹涌的媒介技术革新浪潮,麻省理工学院比较媒介研究中心主任亨利·詹金斯在其名著《文本盗猎者》中提出了“参与式文化(participatory culutre)”这一概念,他还颇具建设性地指出:当今不断发展的媒介技术使普通公民也能参与到媒介内容的存档、评论、挪用、转换和再传播中来,媒介消费者通过对媒介内容的积极参与而一跃成为了媒介生产者。尼葛洛庞帝在《数字化生存》中说:“后信息时代的根本特征是真正的个人化”,也就是说,“个人不再被埋没在普遍性中,或作为人口统计学中的一个子集,网络空间的发展所寻求的是给普通人以表达自己需 要和希望的机会”。 在新媒体环境下,新技术为受众参与提供了可能,新观念激发了受众参与的热情。面对新媒体浪潮带来的这一切,我们必须以推动新媒介素养教育作为首要的应对之策。对于新媒体,我们要努力发掘其中的有利因素,同时也要能以辩证的思维来识别与化解新媒体洪流中的种种陷阱和危机,唯其如此才能真正把握时代赋予我们的机遇。

目的是为了把nature06958那篇论文中的图复现出来,重复一下Barabasi的工作,学习别人是怎样做研究写论文的。

目前第一步的具体任务是把一批手机数据中所有用户的数据提取出来,用Python?还是学习一下Hadoop?被困扰了。值得关注的是,到底多大的数据适合用Hadoop?一般应该至少还是要TB以上吧?GB级数据还是考虑优化Python代码加上shell工具来搞吧。

用awk按列处理,缩小数据集大小即可,1月5日先用awk提取出了一个月所有数据中的四列字段,从400多GB缩小到约54GB。

今天,打算从54G的数据中,提取所有用户的轨迹数据。

从18:26开始read第3个chunk,1分钟之内读完,约5分钟之内,处理完该chunk,然后开始进行文件读写,完成这次文件读写用了半小时到一小时,写出来的文件大约有99714个

一个教训:尽可能减少文件读写的次数!太拖时间了!我把54G数据划分成10次处理,这样根本不行,既然内存足够,完全试着可以一次读进去,处理完,一次文件读写。

把数据一次性读进来,只用一个chunk,内存溢出了,128G的内存也撑不住,因为中间要维护一个巨大的dict。分两次chunk来读,内存也不够,遂改为每次读10个G,读5个chunk。

处理第一个chunk时,从8点左右,写文件的进度就到了100%,可是屏幕却一直停滞在这里,top命令显示,程序还在计算,不知道在做什么,想了一下,可能是把数据从缓存往硬盘文件里写吧,write语句执行到100%的时候可能只是往缓冲区里写完了,真正要执行完还得把缓冲区里的文件往硬盘上写,这一步要等的时间真是遥遥无期了。到8点35,才终于开始了第二个chunk的计算。每个chunk算出来的dict大约有4、50个GB,硬盘读写大约要半个多小时。

一个教训:以后跑数据,轻易还是不要做硬盘的读写,尽量放在内存里计算。把计算结果以最小最优的方式记录在文件里,不然要等的时间太长了。

其实这次这个任务实在不行就不要写到硬盘里了,想想办法直接在内存里算出结果。与其浪费时间去想怎么优化代码,怎么学Hadoop、Spark,怎么学Python并行化,不如花时间寻找替代方案,如何减少计算量,如何用更好的工具?实在不行,可以想想有些工作是不是可以用Graphlab来完成?借助Graphlab解决最关键的计算步骤?

Hadoop、Spark、Python并行程序设计,这些东西搞起来太浪费时间了,机会合适的时候可以去研究一下,提升一下自己的能力和技术栈,但现在你必须把精力聚焦在最关键的事情上面!写代码不是目的,写代码是途径,目的是要做出一个好的研究,写出论文来。思想比工具更重要,能够实现自己的思想就行了,不论用什么方法和工具,想清楚这一点很重要。

从财新数据可视化实验室到数据工场——听黄志敏讲座有感,兼谈我对数据新闻的一些思考

陈志聪 MF1611002

2016年9月24日,数据工场创始人、财新网前CTO、财新数据可视化实验室创始人黄志敏作客南京大学新闻传播学院,为南大学子带来了《从数据新闻到数据工场》的知识讲座。黄志敏从2011年入职财新传媒之后一直忙于“重新搭建研发团队,推动新媒体转型”,从2013年6月开始投入数据新闻领域,三年内大小奖项拿了十一个,代表作品之一是财新传媒于2014年7月29日推出的数据新闻《周永康的人与财》,该作品中英文版分获亚洲新闻奖、以及2014腾讯传媒大奖“年度数据新闻”、国际新闻设计协会(SND)多媒体设计优秀奖等。

数据新闻之现状

黄志敏在其讲座的开篇便设一问:“媒体过得好吗?”给出的思考是,这要看如何定义媒体,像纽约时报这样的是媒体,他们可能活得不那么好,那如果像Google、Facebook这样的公司也算媒体呢?他们可能就活得很好了。这是为什么呢?因为这是一个数据为王的时代。如何用数据推动新闻业发展?这是做数据新闻最需要考虑的问题。

数据新闻是什么?这个问题似乎并没有一个确切的回答。从概念来看,它最初来自于国外的计算机辅助报道,后来演变为数据驱动的新闻,简称为数据新闻。从业界来看,国内数据新闻始于国外精确新闻的传入,发端于2009年。2012年前后,国内门户网站才开始纷纷进行数据新闻的初步实践。黄志敏在一次采访中曾说,“目前,我国数据新闻的发展仍处于起步阶段,但是声势比较大。除了财新,还有澎湃、腾讯、人民、新华、网易和搜狐等都在做数据新闻,团队较多;另外,已经有十个左右的高校在开展数据新闻教学,即将开设这方面课程的高校大概有四、五十个。”

数据新闻之教学

黄志敏在讲座中,将数据新闻的生产过程大致分工为4种角色:数据分析师、记者编辑、美术设计师、程序员,事实上,这4种角色也正是财新数据可视化实验室的主要组成部分。当然,黄志敏强调,“4种角色”不一定要有“4个人”,实际上对于一个新兴的团队而言,人越多,沟通和管理的成本反而越高,这是一个不可忽视的大问题,因而最理想的情况是一个人能够兼顾4个方面,会的越多越好。他指出,目前国内的新闻学子中几乎没有这样的全才,但在国外留学的中国学生中有,例如组建数据新闻网的那一小伙留学生。问题出在哪里?黄志敏认为,这主要是因为国内的学生容易以“文科生”自居,从而自己给自己设限。理科生学文科似乎比较容易上手,主要靠多读书和多写作,比如王小波。而文科生学理科的难处主要在于“需要一个更长久的‘实验’或‘训练’的过程。”

黄志敏非常提倡文科生学习数据新闻,他认为一个好的记者往往不是来自于单一的新闻科班教育,而是会有如法律、金融、气象等其他专业的背景。新闻学院的学生不仅要会采写,还要学一些计算机方面的技能,文科生应该学编程吗?黄志敏给出的答案是:“每个学生都应该学习编程,根本没有必要区分文理科……至少要对设计和开发有些基本概念;你不需要写代码,但你得知道代码是怎样写出来的。”文科生学习编程难吗?黄志敏认为,千万不能给自己设限,不要给自己开脱和心理暗示。虽然难,但没有难于上青天。而且作为一个数据新闻的记者,并不需要多么优秀的编程能力。没有什么bug是解决不了的。要不断扩充自己的能力,你必须要满足公司的需求,公司才会需要你。

关于新闻行业的现状,黄志敏提醒我们,不要慌张,不要受纷繁复杂的外界的影响。如果要学习数据新闻,那将注定是一条比较长久而艰辛的道路,所以应该思考一下人生的终极目标,我到底想要什么?学会取舍,才不会把时间浪费在不必要的事情上。“要考虑好在这个方向上,你要得到什么,是想挣钱呢?还是要名份呢?还是要影响力呢?想清楚想要什么,然后去做就好。”

本学期我们南大新闻传播学院给大二的本科生开设了数据新闻课程,我有幸成为这门课的助教。通过和同学们交流,我发现大家对于课程中的大量编程作业大多还是感到步履维艰,但也有同学非常感兴趣,希望找个方向深入下去,这是这门课结束时让人感到欣慰的地方。值得一提的是,有几个信息管理学院跨院系选修的同学,因为之前上过编程基础课(C++程序设计),听这门课就显得轻松许多,兴趣也浓厚得多。这再次验证了黄志敏老师在讲座中提到的点,其实新传学子学习数据新闻最大的困扰是对编程的畏惧和心理上对自己我的束缚,如果能够通过一门先修课帮助同学们破除对编程的畏惧,甚至激发部分同学的兴趣,那必将对数据新闻的教学产生极大的帮助。

对数据新闻的再反思

虽然数据新闻似乎是这几年传媒业的热词,但从某种程度上讲,我觉得目前的数据新闻正在走向衰落,数据新闻刚兴起时像财新《周永康的人与财》那种优秀的数据新闻作品越来越少,越来越多的数据新闻变成了简单的统计图表和花哨的手绘图,形式大于内容,数据的地位正在被可视化代替,这是一个危险的信号,而可视化的手段又极为有限,缺乏创造力。很多时候,用Excel对几十行的小数据集进行简单的排序、统计后绘制几张折线图、饼图,辅以一些非常浅显的分析,便被冠名以数据新闻的“名号”,这显然不应当是数据新闻的本色。更为致命的是,即便是一篇付出了极大的人力物力做出来的数据新闻报道,可能也只是带来非常可怜的阅读量的提升,这很容易极大地打击数据新闻生产者的积极性。

黄志敏于今年8月份离开了财新传媒,创办了数据工场。离职财新给了黄志敏跳出媒介局限的机会,他承认“帮助媒体发展”比“在媒体内部工作”更能激发自己的兴趣。用他的话来说,“我从来不把自己定义为媒体人,也没有什么媒体情结,但我有互联网情结,我觉得自己做什么一定会与互联网有关。……从来不认为自己会被固定在某个方向上。有新事物就不断学习,如果不能干到最好,也要做到不比别人差。……数据新闻是数据可视化的一个子集,数据可视化又是整个数据领域的子集。所以,我很清楚自己是要选定数据领域,而不仅仅是数据新闻这块,我必须去拓展一个新的平台。”

从黄志敏的离职中,我认为他非常隐晦地表达了他对于数据新闻现状的担忧。如黄志敏自己所言,数据新闻对于进入新媒体领域是非常好的切入点,因为它见效快,看得见,摸得着。这是它的好处。但经过这几年的实践,我们发现,数据新闻所面临的问题也是不可忽视的。比如许多有价值的数据获取困难、数据来源有限、数据新闻人才不足、数据可视化形式趋于单一等。更重要的是,数据新闻并没有明确的盈利模式,但投入却相对较大。目前来看单纯靠数据新闻作品换取广告成为盈利模式并不现实,优质的数据新闻作品所能实现的仅为扩大影响力和为客户端导入流量。

数据新闻应当跳脱新闻的框架

有一篇报道这样写道:“(数据新闻)归根到底还是新闻的一种报道形式,并没有脱离新闻单独存在。对于适合使用数据来报道的新闻我们就采用这种形式,但不必盲目跟从。同时,它和VR、无人机报道一样,是全新的新闻报道方式,是媒体融合创新可以突破的领域,是值得新闻工作者进行深入探索的。”这段话中的理性立场值得肯定,但真正引人深思的是,数据新闻作为新闻的一种形式,是否从一开始就注定了它的结局?数据新闻一定要被限制在新闻的框架内吗?

在一次采访中,黄志敏说:“坦率地讲,数据新闻不会解决目前媒体的困境,现在媒体的困境是内容变不了钱的问题。它只是增加一个手段,让你在数据丰富的世界里分析和呈现的手段,增加你产品的竞争力。”我认为这句话说得非常有见地,很多人对数据新闻抱有极大的期待,认为数据新闻可以成为媒体的一条极好的出路,这可能反映了很多媒体人在遭遇行业危机时产生的非常急功近利的心理,见到什么新事物寄希望于它,企图走上一条能够立马解决困境的道路。然而事情哪有这么简单呢?

为了看清楚这个问题,我们应该重新反思一下,究竟怎样的数据新闻是好的数据新闻?我认为好的数据新闻应该是通过深入挖掘数据之间的关联发现全新的视角甚至规律,从而对新闻事件给出一种全新的解读,在解读的过程中,加以精确而独到的可视化展示。这样的生产过程其实跳脱出了日常的新闻知识生产的框架,因为这样的数据新闻很难像日常的新闻那样实现批量生产,某种程度上,它更像是一项精美的社会科学研究,目的是从数据中揭示深层的机理和规律,这样的数据新闻才真正用好了数据,实现了自己的价值。而为什么如今大量的数据新闻越来越把关注点放在可视化上,忽略了对数据本身的仔细打磨、研究和解读,我觉得很大程度上受到新闻生产过程的束缚,为了实现所谓的“时效性”而丢失了更可贵的“深刻性”。

再谈数据新闻的核心价值

黄志敏还说过,“数据新闻也是新闻,新闻怎么赚钱,数据新闻就怎么赚钱。现在数据新闻不赚钱,但做数据新闻的这项技术和能力可以挣钱。”我非常同意这句话,它一针见血地指出了数据新闻“吃力不讨好”的原因,因为大家都把数据新闻看作是新闻大框架下的一种,而如今依靠内容生产的新闻都不赚钱,数据新闻又怎么赚钱呢?但是数据新闻并非一文不值了,它真正的核心价值在于其背后的技术和能力。事实上,数据新闻团队最核心的价值和竞争力在于,它是一支在数据的敏感性、分析能力和呈现形式方面具有诸多优势的媒体团队,这一点若能加以提高和利用,并和公司进行商业合作,或许是能解决盈利问题的。正如黄志敏在某次论坛分享时说的一样:在短期之内你不要高估了数据新闻的威力。但对数据新闻长期的威力你也不要小看它。

抛开新闻的框架不谈,数据新闻本质上是一种能力——从一整套数据分析处理和可视化的工具,到对数据价值的敏感,再到对社会现实独到的认知和理解(这是数据新闻工作者与一般的数据挖掘工程师之间最大的差别),甚至培养出一种强大的洞察力。

黄志敏说数据工场的核心是数据服务,包含数据的挖掘,分析,展示,分享和交易这五部分。“把内容,数据,艺术,技术相结合,与企业、媒体和高校对接。”从这一点上来讲,黄志敏的离职显然就是为了跳脱出新闻媒体原来的框架,用他自己的话说,“跳出媒介的局限,才是对媒介更好的帮助。”针对不同的机构,具体服务形式有所不同,比如帮助企业运用数据,帮助媒体发展数据新闻和实现新媒体转型,帮助高校培养数据人才。

本文的主标题为“从财新数据可视化实验室到数据工场”,表面上看,这只是黄志敏个人的一次职业变动,但我认为这实际上这表明了黄志敏对数据新闻未来的判断。财新数据可视化实验室创立初期通过较为领先的可视化技术大放异彩,如今已不比当年。我相信黄志敏一定认识到数据新闻必然从原先重视“可视化”走向重视“数据”本身,才会做出这样的选择。数据新闻的根本价值一定不在于可视化,而在于对数据的理解与阐释,它与工程师们所从事的机器学习与数据挖掘工作最大的区别在于,它的目的是实现对社会、对人性本身的关怀,借助的是社会科学的领域知识(domain knowledge)和人文素养,它比算法更有温度,比工程师更具情怀,这是它独有的、无可替代的核心价值。数据新闻只有打开思路,真正从数据出发去探索其特有的社会价值,才可能有旺盛的生命力,也才可能实现曲线救国,为正在困境中的媒体找到一条极好的出路。

本文为2016-2017学年第一学期《名记者进课堂》课程大作业。

任务:从一个月的数据中提取出每个用户的移动坐标

从20131224的数据中找出所有的用户

1
awk -F "\"*,\"*" '{print $1}' gprs_bh_20131224.del | sort -su | wc -l

写到一个文件all_users.txt,共72909个用户。对这72909个用户,从20131201-20131231的数据中找出他们的移动轨迹

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Define main
main(){
cat all_users.txt | while read name
do
echo $name
for file in ./flowdata/*.del
do
echo $file
cat $file | grep $name >> ./users/user_$name.dat
done
# add csv title
title='user_id,access_mode_id,logic_area_name,lac,ci,longtitude,latitude,busi_name,busi_type_name,app_name,app_type_name,start_time,up_pack,down_pack,up_flow,down_flow,site_name,site_channel,cont_app_id,cont_classify_id,cont_type_id,acce_url'
sed -i "1i$title" ./users/user_$name.dat
done
}
# Invoke main
main

从周六上午算到周一上午,只算出了97个用户,离72909还太遥远。固然可以先用目前的数据拿来算算看,但要想真把所有用户的轨迹提取出来,可能还非得学习一下Hadoop MapReduce的编程。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# 输出开头5行的第1列
head -n 5 gprs_bh_20131201.del | awk -F "\"*,\"*" '{print $1,$6,$7,$12}'
output:
user_id longtitude latitude start_time
31000773 116.345373 39.998725 20131201002143
41365834 116.107333 39.911472 20131201002332
43100000155674 116.173231 40.201851 20131201002404
31000773 116.345373 39.998725 20131201002116
# 也可以
head -n 5 gprs_bh_20131201.del | awk -F "," '{print $1,$6,$7,$12}'
output:
user_id longtitude latitude start_time
31000773 "116.345373" "39.998725" "20131201002143"
41365834 "116.107333" "39.911472" "20131201002332"
43100000155674 "116.173231" "40.201851" "20131201002404"
31000773 "116.345373" "39.998725" "20131201002116"
# 输出所有文件的第1,6,7,12列
awk -F "\"*,\"*" '{print $1,$6,$7,$12}' *.del >> user_loc_time.txt
# 找出文件第一列的所有内容,并去重
awk -F "\"*,\"*" '{print $1}' gprs_bh_20131224.del | sort -su
# >清空文件并写入,>>追加写入
# 查找含有用户'102940566'的行
cat gprs_bh_20131224.del | grep '102940566' > user_102940566.dat
# 找出当前目录下所有大小为0的文件,列出来
find . -size 0
# 查看占用9999端口的进程的PID
lsof -i :9999
# 输出大文件的前n行到一个小文件中
head -n 10000 gprs_bh_20131224.del > gprs_bh_20131224.del.cut
# 下载复旦的社交媒体数据
wget -cq http://sma.fudan.edu.cn/corpus/Twitter_Data/Twitter.Corpus.7z &
wget -cq http://sma.fudan.edu.cn/corpus/SMP_2015_Weibo_Data/SMP-Weibo.7z &
ps -aux | grep wget
ll *.7z
# 使用chown命令可以修改文件或目录所属的用户:chown 用户 目录或文件名
# (把home目录下的qq目录的拥有者改为qq用户)
chown user /home/dir
# 使用chgrp命令可以修改文件或目录所属的组:chgrp 组 目录或文件名
# (把home目录下的qq目录的所属组改为qq组)
chgrp usergrp /home/dir
# 在文件file 中,首行添加123456789
sed -i '1,i123456789' yourfile
# 在文件的首行插入指定内容:
:~$ sed -i "1i#! /bin/sh -" a
# 执行后,在a文件的第一行插入#! /bin/sh -
# 在文件的指定行(n)插入指定内容:
:~$ sed -i "niecho "haha"" a
# 执行后,在a文件的第n行插入echo "haha"
# 在文件的末尾行插入指定内容:
# 用第二种方法也可以。一般实现:
:~$ echo “haha” >> a
# 执行后,在a文件的末尾行插入haha

查看当前目录下所有文件的大小

1
du -sh * | sort -n > files.txt

查看磁盘占用情况

1
df -hl

使用jieba对中文文本进行分词

jieba,中文名为“结巴”,力争要做“最好的”Python中文分词组件,jieba主要支持三种分词模式:

  • 全模式,把句子中所有的可以成词的词语都扫描出来, 速度非常快,但是不能解决歧义;
  • 精确模式,试图将句子最精确地切开,适合文本分析;
  • 搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适合用于搜索引擎分词。
1
2
3
4
5
6
7
8
9
10
11
12
13
import jieba
seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list)) # 全模式
seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list)) # 精确模式
seg_list = jieba.cut("他来到了网易杭研大厦") # 默认是精确模式
print(", ".join(seg_list))
seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所,后在日本京都大学深造") # 搜索引擎模式
print(", ".join(seg_list))

参考资料:https://github.com/fxsjy/jieba/

使用snownlp做简单的文本分析

SnowNLP是一个Python写的类库(package),可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob不同的是,这里没有用NLTK(一个英文文本处理包),所有的算法都是自己实现的,并且自带了一些训练好的字典。

注意:本程序都是处理的unicode编码,所以使用时需要自行decode成unicode。

示例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
from snownlp import SnowNLP
# 给定一句话,内容是“这个东西真心很赞”,用SnowNLP分析
s = SnowNLP(u'这个东西真心很赞')
# 分词:提取句子中的所有词语
s.words # [u'这个', u'东西', u'真心',
# u'很', u'赞']
s.tags # [(u'这个', u'r'), (u'东西', u'n'),
# (u'真心', u'd'), (u'很', u'd'),
# (u'赞', u'Vg')]
# 分析这句话的情感倾向,取值为(0,1),越靠近1,情感正向的概率越大
s.sentiments # 0.9769663402895832 positive的概率
# 把这句话中的词转化为拼音
s.pinyin # [u'zhe', u'ge', u'dong', u'xi',
# u'zhen', u'xin', u'hen', u'zan']
# 把一句繁体字转换为简体汉字
s = SnowNLP(u'「繁體字」「繁體中文」的叫法在臺灣亦很常見。')
s.han # u'「繁体字」「繁体中文」的叫法
# 在台湾亦很常见。'
# 给定一段较长的文本内容(以Unicode编码)
text = u'''
自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。
它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。
自然语言处理是一门融语言学、计算机科学、数学于一体的科学。
因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,
所以它与语言学的研究有着密切的联系,但又有重要的区别。
自然语言处理并不是一般地研究自然语言,
而在于研制能有效地实现自然语言通信的计算机系统,
特别是其中的软件系统。因而它是计算机科学的一部分。
'''
s = SnowNLP(text)
# 计算这段话中的关键词
s.keywords(3) # [u'语言', u'自然', u'计算机']
# 计算这段话的摘要(本质上是对这段话中的每个句子计算其重要程度,利用的是TextRank算法)
s.summary(3) # [u'因而它是计算机科学的一部分',
# u'自然语言处理是一门融语言学、计算机科学、
# 数学于一体的科学',
# u'自然语言处理是计算机科学领域与人工智能
# 领域中的一个重要方向']
# 给定一组词汇
s = SnowNLP([[u'这篇', u'文章'],
[u'那篇', u'论文'],
[u'这个']])
# 计算词频(Term Frequency)
s.tf
# 计算逆文档频率(Inverse Document Frequency)
s.idf
# 计算与另一组词语的相似度(Similarity)
s.sim([u'文章'])# [0.3756070762985226, 0, 0]

参考资料:https://github.com/isnowfy/snownlp

Understanding individual human mobility patterns

Marta C. Gonzalez, Cesar A. Hidalgo & Albert-Laszlo Barabasi

原文链接:http://www.nature.com/nature/journal/v453/n7196/abs/nature06958.html

Abstract

我们发现,与之前普遍采用的列维飞行和随机游走模型预测出来的随机轨迹相反,人类的实际移动轨迹显示出高度的时间和空间上的规律性,每个个体的特征可以用一个与时间无关的特征移动距离和一个显著的回到几个经常光顾的地方的概率。在校正行进距离的差异和每个轨迹固有的各向异性之后,个体行进模式转变成单个空间概率分布,表明尽管他们的旅行历史具有多样性,人类的移动轨迹遵循简单的可再现模式。

We find that, in contrast with the random trajectories predicted by the prevailing Levy flight and random walk models, human trajectories show a high degree of temporal and spatial regularity, each individual being characterized by a time-independent characteristic travel distance and a significant probability to return to a few highly frequented locations. After correcting for differences in travel distances and the inherent anisotropy of each trajectory, the individual travel patterns col- lapse into a single spatial probability distribution, indicating that, despite the diversity of their travel history, humans follow simple reproducible patterns.

Background

之前的研究发现,动物(如信天翁、猴子等)的移动轨迹可以用列维飞行来进行近似。列维飞行是一种随机游走,它的每一段飞行(step size),记为 $\Delta r$ ,符合幂律分布。
$$ P(\Delta r) \sim \Delta r^{(1+\beta)} $$

之前对于纸币的追踪研究发现,人类的移动轨迹可以被建模为连续时间随机游走与长尾的位移和等待时间分布。

但是对于纸币的追踪其实反映了两个或更多个人在两次报告位置的时间点之间携带该纸币的复合运动。因此,不能清楚地说明,观察到的分布究竟反映了某个个体用户的运动,亦或是基于整个人群的异质性与个体人类轨迹之间的一些以前未知的卷积。与此相反,手机在一个人的日常生活中一直由同一个人携带,因而提供了捕捉个人人类轨迹的最佳代理。

Data

D1 Dataset

欧洲的手机移动数据,包括日期、时间、(每一次通话或者短信行为记录下来的)基站坐标,600万用户,6个月时长

This dataset was collected by a European mobile phone carrier for billing and operational purposes. It contains the date, time and coordinates of the phone tower routing the communication for each phone call and text message sent or received by 6 million costumers. The dataset summarizes 6 months of activity.

因为只能记录基站的位置,所以在数据中,1000km以上的jump不能被捕捉到

Each tower serves an area of approximately 3 km2. Due to tower coverage limitations driven by geographical constraints and national frontiers no jumps exceeding 1, 000 km can be observed in the dataset.

我们没有考虑用户跨大陆之间的jump

We removed all jumps that took users outside the continental territory.

D2 Dataset

有些手机应用要求定期记录下用户的位置,而和用户是否打电话、发短信无关

Some services provided by the mobile phone carrier, like pollen and traffic forecasts, rely on the approximate knowledge of customer’s location at all times of the day. For customers that signed up for location dependent services, the date, time and the closest tower coordinates are recorded on a regular basis, independent of their phone usage.

这样的数据,我们拿到了206个用户,10613条记录,均满足在一周之内每两小时记录一次自己的坐标

We were provided such records for 1, 000 users, among which we selected the group of users whose coordinates were recorded at every two hours during an entire week, resulting in 206 users for which we have 10, 613 recorded positions.

因为这些用户都是基于(sign up to the service)行为而选择出来的,所以可能存在一定的倾向性(bias),不过目前来看,我们没有detect到这样的bias。

as these users were selected based on their actions (signed up to the service), in principle the sample cannot be considered unbiased, but we have not detected any particular bias for this data set.

Observation

The distribution of $\Delta r$

We measured the distance between user’s positions at consecutive calls, noted as $\Delta r$

$$ P(\Delta r) = (\Delta r + \Delta r_0)^{-\beta}exp(-\Delta r/\kappa) (1)$$

Equation (1) suggests that human motion follows a truncated Levy flight

However, the observed shape of $P(\Delta r)$ could be explained by three distinct hypotheses:

  • first, each individual follows a Levy trajectory with jump size distribution given by equation (1) (hypothesis A);
  • second, the observed distribution captures a population-based heterogeneity, corresponding to the inherent differences between individuals (hypothesis B);
  • third, a population-based heterogeneity coexists with individual Levy trajectories (hypothesis C); hence, equation (1) represents a convolution of hypotheses A and B.

The distribution of $r_g$

To distinguish between hypotheses A, B and C, we calculated the radius of gyration for each user (see Supplementary Information IV)

$$P(r_g)=(r_g+r_g^0)^{-\beta_r}exp(-r_g/\kappa)$$

Question: 如何计算$R_g$?回转半径如何理解?

Relationship Between ${r_g}$ and t

The longer we observe a user, the higher the chance that she/he will travel to areas not visited before.
观测的时间越长,一个人越有可能去他没去过的地方

We measured the time dependence of the radius of gyration for users whose gyration radius would be considered

  • small ($r_g(T)$ <= 3 km),
  • medium (20 < $r_g(T)$ <= 30 km) or
  • large ($r_g(T)$ > 100 km)

at the end of our observation period (T = 6 months).
The results indicate that
the time dependence of the average radius of gyration of mobile phone users is better approximated by a logarithmic increase,
not only a manifestly slower dependence than the one predicted by a power law
but also one that may appear similar to a saturation process
(Fig. 2a and Supplementary Fig. 4).

Relationship Between $P(\Delta r | r_g)$ and $\Delta r$

图2b显示,$r_g$较小的用户,通常在小范围内活动,而那些具有大的$r_g$的人,则更倾向于选择许多小和几个更大的跳跃大小(jump size)的组合。

As the inset of Fig. 2b shows, users with small $r_g$ travel
mostly over small distances, whereas those with large $r_g$ tend to
display a combination of many small and a few larger jump sizes.

对坐标轴进行rescale以后,多条曲线合并成了一条直线,这表面,可能有一个单一的jump size分布,能够拟合所有人。

Once we rescaled the distributions with $r_g$ (Fig. 2b), we found that the
data collapsed into a single curve, suggesting that a single jump size distribution characterizes all users.

$$P(\Delta r | r_g) \sim r_g^{-\alpha} F(\Delta r | r_g)$$

where $\alpha \approx 1.2 \pm 0.1$ and F(x) is an $r_g$-independent function with asymptotic behaviour, that is,
$F(x) \sim x^{-a}$ for x < 1 and F(x) rapidly decreases for x >> 1
(这个F函数在x<1时是幂律的,x>1时急剧下降)

因此,个体用户的旅行模式可以通过基于$r_g$的特征距离的Levy飞行来近似。然而,最重要的是,个体轨迹的范围受到$r_g$范围的限制;因此,作为列维飞行的明显而反常性质的来源的那种大尺度飞行(地理空间上的跨越)在统计上好像消失了。

Therefore, the travel patterns of individual users may be approximated by a Levy flight up to a distance characterized by rg.
Most important, however, is the fact that the individual trajectories are bounded beyond rg;
thus, large displacements, which are the source of the distinct and anomalous nature of Levy flights, are statistically absent.

这表明,我们所观察到的$P(\Delta r)$ 的分布实际上是对个体轨迹的统计$P(\Delta r | r_g)$ 和群体的异质性$P(r_g)$之间的卷积,也就是说,假设C成立。

This indicates that the observed jump size distribution $P(\Delta r)$ is in fact
the convolution between the statistics of individual trajectories $P(\Delta r | r_g)$ and
the population heterogeneity P(rg), consistent with hypothesis C.

计算t小时后第一次观测到某人出现在某地的概率

To uncover the mechanism stabilizing $rg$, we measured the return probability for each individual $F{pt}(t)$
(first passage time probability)
defined as the probability that a user returns to the position where he/she was first observed after t hours (Fig. 2c).

人们的返回概率(return probability)往往会在24h、48h、72h后出现峰值

In contrast, we found that the return probability is characterized by several peaks at 24 h, 48 h and 72 h,
capturing a strong tendency of humans to return to locations they visited before,
describing the recurrence and temporal periodicity inherent to human mobility

将地点(location)按照被访问次数排序(rank),通过记录个体(individual)所在的附近地区(vicinity),发现符合Zipf分布。

To explore if individuals return to the same location over and over,
we ranked each location on the basis of the number of times an individual was recorded in its vicinity

一个人去的地方的排名(rank),记为L的概率P可以用1/L来预测,而与用户去的地方的数量无关(对于去5个、10个、30个、50个地方的用户,他们的幂指数都是一样的)

We find that the probability of finding a user at a location with a given rank L is well approximated by $P(L) \sim 1/L$, independent of the number of locations visited by the user (Fig. 2d).

Preferential Return

人们大部分时间只会在很少的几个地方

Therefore, people devote most of their time to a few locations,
although spending their remaining time in 5 to 50 places, visited with diminished regularity.

Therefore, the observed logarithmic saturation of $r_g(t)$ is rooted in the high degree of regularity in the daily travel patterns of individuals,
captured by the high return probabilities (Fig. 2b) to a few highly frequented locations (Fig. 2d).
因此,之前观察到的$r_g$的对数饱和度是根据个人的日常旅行模式的高度规律性,由高回报概率捕获到几个高度常见的地点

每个用户可以被分配到由家庭和工作场所定义的明确定义的区域,在那里她或他大部分时间都能被找到。

each user can be assigned to a well defined area, defined by home and workplace,
where she or he can be found most of the time.

我们的结果表明:

  • 在银行纸币测量中观察到的Levy统计数据捕获了等式(2)中所示的- 群体异质性与个体用户的运动的卷积。
  • 个人显示出显着的规律性,因为他们回到几个经常访问的地方,如家庭或工作。
  • 这种规律性不适用于钞票:票据总是遵循其当前所有者的轨迹; 也就是说,美元钞票弥漫(是散播开的),但人类没有。

Taken together, our results suggest that
the Levy statistics observed in bank note measurements capture a convolution of the population heterogeneity shown in equation (2) and the motion of individual users.
Individuals display significant regularity, because they return to a few highly frequented locations, such as home or work.
This regularity does not apply to the bank notes: a bill always follows the trajectory of its current owner; that is, dollar bills diffuse, but humans do not.

总而言之,个体轨迹可以由相同的、与$r_g$独立的二维概率分布表征,这表明个体轨迹的关键统计特征在重新缩放之后在很大程度上是不可区分的。因此,我们的结果确立了agent-based modelling的基本假设,要求我们将对用户的数量要与给定区域的人口密度成比例,并向每个用户分配取自观察到的$P(r_g)$分布的$r_g$。使用预测的各向异性重新缩放,结合密度函数,其形状提供为Supplementary Table 1,
我们可以获得在任何位置找到用户的可能性。鉴于空间接近度和社会联系之间的已知相关性,我们的结果可以帮助量化空间在网络发展和进化中的作用,并增进我们对扩散过程的理解。

The fact that individual trajectories are characterized by the same rg-independent two-dimensional probability distribution
suggests that key statistical characteristics of individual trajectories are largely indistinguishable after rescaling.
Therefore, our results establish the basic ingredients of realistic agent-based models,
requiring us to place users in number proportional with the population density of a given region
and assign each user an rg taken from the observed P(rg) distribution.
Using the predicted anisotropic rescaling, combined with the density function, the shape of which is provided as Supplementary Table 1,
we can obtain the likelihood of finding a user in any location.
Given the known correlations between spatial proximity and social links, our results could help quantify the role of space in network development and evolution and improve our understanding of diffusion processes.