From Subject To Soundbite, Results To Hype: Communicating Data Science

Last week I was forwarded an article from Science News describing a new data analysis as “one of the largest text and data mining exercises ever conducted.” As top scientific journals like Science join the growing trend of emphasizing the size of data analyses, what might we learn about how data science is increasingly being communicated to the public?

Has the way in which we talk about innovation and technical discovery changed over the past half century? In 2010, Paul Magelli and I explored this question through the lens of the digitized Proquest New York Times archive, searching all 18 million articles it had published 1945 to 2005 for any mention of an American research university. While major research universities represent only one dimension of the technical innovation landscape and only a portion of their discourse revolves around their research, they historically have been a key source of coverage of basic research breakthroughs. By assessing how they have been portrayed in the media, we gain a least a partial understanding of the public information environment around research institutions and what an average member of the public will see about them on a daily basis.

In the case of the New York Times we found that while the Times had shrunk linearly by half over the last 60 years, its coverage of research universities had remained constant in terms of raw article count. Thus, as a percentage of all Times articles published each year, universities had steadily increased, with at least one research institution mentioned in 13% of all articles and 21% of all frontpage Times articles per year by 2005.

However, even as universities have become a mainstay of news coverage, the nature of how they are mentioned has changed dramatically. In 1946, 53% of articles mentioning a research university did so in the lead paragraph, suggesting the discussion was about that institution and its activities and research, while in 2005, just 15% of the articles mentioning a university were primarily about that institution. The remaining articles typically only cited prominent faculty or research as soundbite commentary on major events.

In short, over half a century, the nation’s premiere research institutions had gone from newsmakers to news commentators. Rather than the subject of the news themselves, universities and their faculty had become merely quick on-demand soundbites about other people's news.

Why is this important? If researchers today want their work covered as the subject of the news, rather than a soundbite about other events, they face increasing competition for attention.

Drawing from this, as data science and data mining has become mainstream, there has been a marked focus on emphasizing the size and scale of such projects, rather than their caveats or limitations. Imagine two projects that yield the same results, but one analyzes a few megabytes of data and one analyzes 100 petabytes. The latter is far more likely to garner headlines across the mainstream press and attract attention within the research community, including future funding. While there is certainly merit to arguing that the dramatically larger sample size could increase confidence in the finding, depending on the experimental design, there is a natural allure in today’s data science community to equating the size of a data analysis with its importance.

Press releases about new data science and technical discussions therefore all-too-often spend their precious word allocation focusing on the size of the dataset examined, rather than on what was actually done and in carefully outlining the caveats. Increasingly even the published papers themselves exclude critical detail that would permit external validation or caveating of the results. I frequently come across Twitter analyses published in high-impact journals whose “data and methods” sections read along the lines of “We analyzed a collection of Twitter data” without any indication of just how that collection was assembled. Was it from the 1% streaming API, the Decahose, the Firehose or some custom collection mechanism? More and more data analytic papers use handwavvy descriptions of what they are analyzing and how it was assembled, making replication or indeed any kind of verification impossible.

Making matters worse, there is a growing trend of population inflation in which papers report the original dataset size before filtering down to the actual content used. For example, a Twitter analysis might report that the full corpus of tweets during the period of interest totaled more than 100 billion posts and that keyword querying was used to filter to a smaller subset of interest that was actually subjected to the analysis, without reporting how large the final subset that was actually analyzed was.

Take a paper whose methods section consists entirely of the statement “A corpus of more than 100 billion tweets was scanned for all posts sent by Syrians about the ongoing civil war.” This statement alone gives none of the information need to verify how representative the results might be. Was this the complete firehose corpus or just keyword searching of the public API? Were there any systematic biases in the material making up this particular subset of tweets? Most importantly, what specifically as the search criteria used to filter for domestic discussion about the civil war? In at least one study I encountered, the authors simply keyword queried for a set of English names of major cities in Syria without considering that those tweets might have been sent by people outside the country.

As the focus becomes less about methods and more about the size of the analysis, this raises the question of how precisely one actually defines the “size” of a data mining study. Raw article count rewards analyses of large numbers of small messages (such as a Twitter analysis), while raw word count at least captures the computational demand of applying many kinds of text mining algorithms to the material. Information density, algorithmic complexity and computational requirements all make for other, equally reasonable, measures of size. Further muddying the waters, computing requirements can vary dramatically based on the kind of analysis being run. Neural networks can require exponentially greater computational resources than their corresponding classical counterparts, while simulations can easily consume hundreds of thousands or even millions of processors with relatively small input datasets.

Moreover, no single metric alone can fully capture the complexity of a data analysis. Take Google’s 2006 n-gram dataset that processed more than a trillion words of text. Few text mining projects come close to that word count, yet building ngram tables is far less computationally demanding than using neural networks to calculate dependency graphs or performing advanced text mining on 234 billion words of books. Similarly, instead of absolute counts, one could assess scale in terms of the percent of available data in the given domain being analyzed. If looking at the combined Western socio-cultural academic literature on Africa and the Middle East over the last 70 years, one may approach a large fraction of that output with just 21 billion words of material, rather than Google’s one-trillion-word collection. Additionally, the results of large data analyses may simply be integrated into algorithmic updates or released over time through public presentations and blog posts, rather than formal papers published in the academic literature, making it even more difficult to assess the scale novelty of a new study.

Returning to the Science News article, in correspondence with the authors themselves, it appears that rather than “one of the largest text and data mining exercises ever conducted,” a more apt description of their work might have been “one of the largest comparisons published in the academic literature comparing full text informational content to abstracts in the domain of primarily biomedical academic literature.” While such a caveated statement hardly rolls off the tongue, it far more accurately captures the boundaries of the datasets and methods utilized in the study and its ultimate impact on its field, rather than making the blanket argument that it was the largest data mining effort ever performed. In this case, Science's news editor emphasized that the journal works hard to properly situate its coverage of data analyses, but declined to comment further on how it approaches the issue of communicating caveats.

Why is this important? In a word: replication. The sciences are increasingly grappling with a replication crisis in which even some of the most influential studies yield very different results when other researchers attempt to recreate their findings. This in turn undermines public confidence in the output of the research community. The experimental sciences have in their favor a long history of precise descriptions of experimental setups, from the equipment and configurations collecting the data to the filtering and processing mechanisms preparing it for analysis. The data sciences, on the other hand, operate far more like the Wild West, where vague descriptions of data collection and filtering and bespoke exotic analytic environments are far from uncommon, making replication all but impossible.

In the case of the Science News study, the authors themselves were clear in their paper as to the specific datasets they used and their acquisition mechanism, but even the most rigorously detailed of papers undergo a game of telephone as they are covered in the mainstream and social media and discussed by those outside the field. Should data scientists push back when they see their work described in hyperbolic terms? Should they emphasize the caveats and context of their work when speaking with journalists or engaging with the public, knowing full well that such additional details might discourage coverage of their work?

Putting this all together, as the data sciences have become ever more mainstream and the cacophony of research clamoring for attention competes with a media shift towards researchers as soundbites rather than subjects, there has been a growing trend to focus on the size of data analyses rather than present them in carefully nuanced context and to dispense with the kind of intricate detail on data collection and preparation that makes possible future replication and validation. Do data scientists owe a professional duty to counterbalance the natural tendency of those outside the field to overhype “big data” analyses or exclude the caveats on their findings? There are no easy answers, but at the end of the day if data science wants to be taken seriously as a field, it has to adopt the technical trappings of the rest of the experimental sciences, which place a premium on detailed substance over style to avoid becoming just another soundbite.

More From Forbes

From Subject To Soundbite, Results To Hype: Communicating Data Science