In a Data-Driven World, Honesty is the Fundamental Virtue

In times of war, there are no greater virtues than loyalty and bravery. A country with disloyal citizens is likely to lose every battle. Bravery is an essential trait to overcome the harshness of war.

Likewise, for most of human history, sexual purity was promoted as an essential virtue. Because contraceptives were not an option, a promiscuous society would be one with frequent unwanted pregnancies. As a result, many societies developed strong cultural norms to discourage physical relationships before marriage.

Today, the world is relatively free of wars and effective contraceptives are widely available, so these traits are valued less. Our culture constantly re-evaluates its norms.

Meanwhile, as more and more of our lives are recorded, the data we collect facilitate better decision-making. I’ve written about many aspects of that: data help journalists find better stories, help predict the future, and much more.

However, a data-driven society is only functional when people follow the right cultural norms. In a country at war, a culture of loyalty helps ensure that everyone is in line. In a capitalist society, a culture that discourages theft can allow small businesses to prosper without fear of losing property. In a data-driven society, we must stay intellectually honest.

Without intellectual honesty, the data are flawed and unreliable. Flawed data lead to poor decision-making; it’s usually better to use only your gut than to rely on a poorly formed data set. And unfortunately, many people are using data in intellectually dishonest ways.

Schools and Cheating

In a data-driven world, we must not cheat.

One of this year’s Goldsmith Award finalists is an astonishing data-driven series which uncovered high levels of school cheating. I’m proud that we honored that story, but I’ve been somewhat taken aback by some of the responses I’ve heard to it.

Many people I’ve spoken with, when told of this investigation, immediately blamed the reward structure. No Child Left Behind, they say, created an overly pressurized education system. The sort of large scale cheating exposed by the Atlanta Journal-Constitution was inevitable given the high stakes of the tests.

That is nonsense. One wouldn’t excuse a CEO stealing money from others because there was so much pressure on him to improve his company’s performance — even if the CEO thought the means of evaluating him were unfair. No Child Left Behind and other education policies aren’t perfect (my suggestions), but they’re a starting point.

To improve, we’ll need to refine our testing system and get better at measuring progress. We’ll also need a culture of integrity from teachers and administrators. Without that integrity, our system will consist of results we can’t trust — and a terrible example for students.

Guns and Intellectual Curiosity

In a data-driven world, we must approach major issues with an open-minded, intellectually curious approach.

Of course people use data dishonestly for political arguments. But it’s not just sleazy politicians: I see intelligent friends on both sides completely misrepresenting the data on gun violence in the US. Anti-gun advocates point out that the U.S. has more gun ownership and more gun deaths than other Western countries and jump to the “obvious” conclusion that more guns means more violence. Pro-gun advocates point out that the U.S. has far higher gun ownership rates than many (non-Western) countries that are much more violent than they U.S.; they jump to the also “obvious” conclusion that criminals will find a way to purchase guns regardless of gun policy, meaning decreases in gun ownership would have no impact on gun violence.

It’s likely that each side is at least partially correct. Millions more guns in Americans’ hands mean at least a few more deaths; many of the most violent criminals will find a way to kill regardless of gun laws.

Yet my friends who post these stats do so with a lack of intellectual curiosity. In most cases, they haven’t looked at the numbers with an open mind, and they don’t really understand how gun dynamics work. I’ve never seen, for instance, someone point out that gun ownership in states is highly correlated with suicide rates but minimally correlated with murder rates. That fact — which implies that less restrictive gun laws may lead to suicides but not terrible crimes — doesn’t fit neatly into anyone’s pro- or anti-gun view of the world.

I’m realistic: I don’t expect that everyone is going to gather data sets on gun violence on their own. However, because the data are out there, I ask smart people to raise their bar: if you haven’t looked at the data closely enough to have an informed, nuanced opinion, please keep quiet. Either do some real research, or don’t spread your uninformed perspective.

Pitching Investors and Misleading

In a data-driven world, we must not mislead or be misled.

Working with many early stage companies, I see a lot of investor pitches.

Startups have gotten better at crafting an appealing pitch, by throwing out numbers like these:

  • a) We’ve increased revenue by 25%, month over month
  • b) Our user base is growing: we had 500,000 users last July and now have 800,000 (often with an attached graph showing total users at the end of each month)
  • c) Our monthly retention rate is 75%

Most of the time, these stats aim to mislead.

(a) doesn’t have a baseline or a time horizon. It might mean that revenue was $12 last month and is $15 this month. Does that sound as impressive?

(b) looks at cumulative signups rather than monthly signups. Cumulative signups are always going up, and it takes a lot more effort for the viewer to see whether the second derivative (signups this month versus last month) was positive or negative. We did this in investor presentations for Circle of Moms, because we knew it would obfuscate some of our negative trends. I advocated that approach, and I don’t feel great about it.

(c) may be a useful stat, but it’s almost always calculated in an obfuscated, company-friendly way.

These kind of pitches are “just the way it is” — as is the case for misleading political data analysis. But in a data-driven world, we need to aim higher.

Entrepreneurs should assume their audience is intelligent and mature, cognizant that not all numbers go up.

And investors should understand what they’re looking at, and should call BS on entrepreneurs who surface their numbers like this.

Conclusion

Access to data is, on the whole, a very good thing. Deep, data-driven knowledge allows us to make better decisions and preserve resources. With the right data, we can better reward the best teachers, fund the top companies, and create better public policy.

However, for that to work, our society needs to create a stronger culture of honesty around data. We can’t cheat to get around failures. We must seek out all of the facts, and not promote only those that fit into a narrow ideology. And we must use data to inform rather than mislead. If we don’t do those things, we’ll make decisions that are driven by flawed data — lies — and many will suffer.

The good news is that we’re still early in an age of data-driven decision-making. Our collective culture has developed to better discourage practices like stealing and killing. In this wonderful age of data and better decision making, can we become more honest?

Journalism, Through the Eyes of a Data-Focused Entrepreneur

Fueled by wine and delicious food, the table was full of energy. The journalists and politicians at my table were eager to outdo one another. They exchanged candid personal stories about famous TV newsmen and potential presidential candidates. They recounted tales of the shocking political corruption they’d uncovered. They told their colleagues what had really gone on at that recent big event. Unable to compete with their stories, I nodded politely.

Somehow, I’d found myself at the upscale Rialto restaurant in Harvard Square, feeling like a fly who’d landed on the wrong wall. I was twenty-four, a quiet and unassuming fraud R&D scientist at a small Silicon Valley startup called PayPal. The eight or nine people at my table were quite unlike me: all big names from journalism and public policy, all far more extroverted than I, all at least twice my age.

It was January 2002, and I was sure that two worlds couldn’t be any more different. Here in the journalism sphere were politicians and journalists, extroverted and boisterous, who told great stories; back at PayPal, analytical, introverted nerds whose skills were mostly technical. At the dinner table in Cambridge were those who could talk to sources and get the scoop; back in Silicon Valley, engineers who automated processes and crunched data.

Within about a decade, all of that would completely flip. News organizations would embrace the data-heavy, analytical approach more common to tech companies. Many of 2013′s top stories would use data at a level that was unfathomable in 2002.

The Goldsmith Awards

My first journey to Cambridge was set in motion only a few days before that dinner, when I got an urgent call from my grandfather.

Can you be in Boston on Saturday to help select the winners of the Goldsmith Awards?

I had only a vague sense of what the Goldsmith Awards were; why exactly should I fly across the country on a few days’ notice?

The Goldsmith Awards, he explained to me, were something he’d worked with the Shorenstein Center at Harvard to set up, using assets from the estate of the late Berda Goldsmith (his legal client). The awards honored great journalism, but their true goal was to foster better public policy. He wanted to reward journalism that shines a light on government, highlighting bad regulations and bad policymakers for the benefit of ordinary citizens.

Being a lawyer who was both thoughtful and crafty, Bob put in place a contract that would maintain close ties between the foundation and the Shorenstein Center. A key clause in the contract stated that one of the award’s judges must be a foundation representative.

Bob had called me because he wanted me to take over as the foundation representative. He hoped that I could go to Cambridge on Friday to see how everything worked. Then the following year, I could represent the Greenfield Foundation’s on the Goldsmith selection panel.

Of course, I said. I’d long been interested in public policy — I minored in Political Science at Stanford — and this would be a real honor.

2002: Reporting

At the judging session — before the dinner at Rialto — I got my first jolt of culture shock. The Goldsmith panel of judges evaluated dozens of newspaper submissions, and their criteria were often a surprise. I’d long been a consumer of the news; that day I got to see a news professional’s perspective for the first time.

Some stories were very impressive on the surface: engaging, in-depth, surprising reports on a policy topic I knew nothing about. It turned out, however, that they closely resembled another story told earlier by someone else. The first story a journalist told about toxins in the local drinking water was probably very impressive; the twelfth such story — reported using the same template as the first one — doesn’t deserve an award.

More stories were dinged for reasons I wouldn’t have fathomed as a mere news consumer. Some were largely the product of a single leak: they came from an insider who wanted his story told, rather than from sleuthing by the reporter. Others were impressive investigative feats, but pointed to flaws in public policies which had virtually no chance of being changed.

The best pieces that year were original, impressive in the depth of their investigations, and had substantial impact on policy. The winner, about hospital care at the Hutch in Seattle, stood out for the sheer amount of manual work it required: the reporters had to wade through “100 interviews and 10,000 pages of documents” to tell their story. The story was amazing, and was notable to me for the set of skills used by the reporters: their methods were a far cry from the algorithm coding I was doing at PayPal.

2013: Reporting and Data

Serving on the Goldsmith panel soon became a tradition for me. Last week, for the twelfth time, I found my heavy winter jacket in the back of the closet (it’s useless in the Bay Area) and packed up for a January weekend in Boston. I’m now the veteran; this year I served with several people who had never judged a competition like the Goldsmith Awards. I still have yet to judge with a panelist younger than me, but I no longer elicit the “what’s that little kid doing at the table?” stares I saw a decade ago.

But that change is predictable: I knew I wouldn’t stay twenty-four forever.

The big surprise is that investigative journalism, so different from my PayPal day job in 2002, now feels like a natural project for a Silicon Valley startup data guy like me.

Journalism has changed a lot in the past decade. In 2002, almost all investigative stories were anecdotal. A story about ineffective education started and ended with interviews of teachers, parents and students. The investigation about medical treatment told stories of the travesties patients had endured, without using terms like “probability” or “percentage”, let alone “false positive”. Occasionally, a Goldsmith submission would talk about the painstaking work that reporters had done to piece together hundreds of paper records to assemble some basic statistics.

Today, by contrast, data analysis plays a huge role in many of the top stories. Of this year’s six finalists, at least three would have been unlikely or impossible ten years ago:

  • Cheating our Children, from the Atlanta Journal-Constitution, is a story about cheating by teachers and schools on standardized tests. The team looked at thousands of districts across the country for highly suspicious anomalies, like every student in a class (supposedly) erasing an incorrect answer to question #27 and then filling in the correct answer. They found several hundred patterns of student improvements that were most likely the result of fraud.
  • State Integrity Investigation, from the Center for Public Integrity, looks at the laws of each of the fifty states and grades each on their risks for corruption. To do this, a reporter in each state perused that state’s practices and regulations — a far more manual approach than Cheating our Children — and assembled a database of information about that state. The end result is both a great way of pressuring states (Utah, don’t you want to improve your D?) and an incredible Wikipedia-like online resource for others (especially journalists) interested in tackling related topics in the future.
  • The Shame of the Boy Scouts, from the LA Times, is the sad story of thousands of incidents of child sexual abuse records in the Boy Scouts. The Times pulled together thousands of newly released Boy Scout child records, using them to tell many unbelievable and sad stories about children who were molested. But the series complemented those stories with a feature that would have been unlikely a decade ago: they posted all of the documents online, for anyone to search and see.

These new data-centered stories are distinguished by three new attributes. The first relates to how a story was uncovered: many stories today are initially found not from a tip, but from a database search. In Cheating our Children, cheating was uncovered not because of a tip from a parent or a teacher, but because of a search for suspicious trends in the data. The steps to get the story were (at a high level) similar to what I was doing at PayPal in 2002: using algorithms to identify a handful of likely fraudsters.

The second data attribute is quantifiability. Historically, journalism has not been a quantitative field, relying instead on an anecdote or two, along with an assumption that “there are many others like them”. And while quantification would be silly for many stories — either Nixon’s people broke into Watergate or they didn’t — it’s an important part of any broader societal story. In 2011, the Goldsmith winner informed the public of local hospital practices that were quantifiably worse than others out there. This year, there were some great not-quite-finalist stories that found and measured the effect of cops speeding and explained just how harmful overly prevalent pain medications can be.

Finally, many of the top stories today are complemented by a structured, searchable database. Each of the three stories above features an interactive tool allowing anyone to find the information most useful to them. On ajc.com, I can look at my local school district for evidence of cheating; on publicintegrity.org I can see how my state fares with respect to corruption risk factors; on latimes.com I can see whether there were any reports of sexual abuse at a specific Boy Scout troop.

Though the world of journalism has its challenges, these are three great developments. They widen the range of stories journalists can tell, they raise the bar on their quality, and they make them individually relevant to the reader.

The Great Bifurcation

The landscape of news and other content is bifurcating, with increasing separation between work that aims to be high-traffic and work that aims to be high-impact. On one side is entertaining content, aimed at driving page views. That content may be news, opinion, or something else, but its goal is very simple: to be part of a traffic machine that underlies an ad-supported online operation.

Traffic machine content is most successful if it arouses curiosity (yes, I do want to check out the six ways that olive oil can help me lose weight!) and can be even more so if it’s also something the reader identifies with and wants to share (this is why people who voted for my political candidate are smart!). That high-traffic story, while cheap to produce, is usually not especially deep or insightful, and it may not even be true. Thus it has little or no positive impact on our institutions.

The other side of the coin, high-impact journalism, is a very different animal. It takes a lot of work and money to produce, but often doesn’t generate a lot of traffic. It may have a great impact on society, but it’s tough to justify for a business. And that’s why, increasingly, it’s the domain of non-profit entities and organizations that are only nominally for profit.

Journalism via a non-profit can be a good thing: those organizations — who today operate at both local and national scale — can focus on the highest impact work rather than try to mix unpopular high-impact stories with popular low-impact ones.

2023: Reporting, Data, and Software

In the new non-profit news organization, there is a simple question to ask: “how can we do work that will have the largest positive impact on public policy?” That is essentially the same question my grandfather asked when he set up the Goldsmith Awards over two decades ago.

To understand how that question will be answered a decade from now, one must first understand the roles played by three different people in today’s professional world:

  • The investigative reporter skillfully combs through documents and asks the right people the right questions to find information. He then turns that information into a compelling story for his audience.
  • The data scientist takes the data available to her and mines it to quantifiably understand a subject. With words, numbers, and data visualization, she shares — usually with less verbal skill than the journalist — that understanding with others.
  • The software developer takes a process that works manually, and figures out how to first generalize it and then automate it. For instance, if you have a meeting in your calendar with an accompanying address, how can software automatically send you directions at the appropriate time? A human can do it manually; the developer writes the software that will automate the process many times over.

A decade ago, the stories I read for the Goldsmith Awards were solely the work of reporters from the first group. They were executed by skilled journalists who knew how to comb through documents, convince insiders to give them secret information, and write stories elegantly.

Today, the data scientist is a key part of journalism: data skills are nearly as important for producing Goldsmith-caliber work as classic investigative skills. Data skills help both at the early phases of a story in finding anomalies worth writing about, and in moving beyond anecdotes to show that trends can be quantified. That anomaly-finding helps increase the range of stories that can be told; quantification makes the stories better.

Still, today’s journalism has a one-off quality that would frustrate a typical software developer. Sure, I can read a story about cheating in schools — or even look at how it affects my hometown — but will the story be automatically updated in three years so it’s still relevant?

In the next decade, it’s likely that we’ll see investigative reporting evolve and improve in several ways:

  • More and more journalism will be automated and updated regularly. District scores will be mined every week; state corruption will be automatically assessed monthly. In some cases, there will be written stories that complement the new data; in other cases the automated jobs will simply feed into an interactive database available to readers.
  • Investigative reporters will get better at soliciting information from their readers and viewers. It’s become a lot easier for readers to contact reporters with tips than it was a few decades ago, but there’s still a lot of room for improvement. Facebook, LinkedIn, Quora, and Twitter make it easier to find and contact the person likely to know a specific piece of information, but they’re not ideal. One could, for instance, imagine a world where citizens could record any suspicious or unacceptable government actions in a form that could be searched by reporters in the future; this would markedly improve many investigative stories.
  • The number of journalists with data skills is increasing rapidly, and that isn’t going to change any time soon: my Twitter feed is filled with data+government+journalism enthusiasts from many different backgrounds. They’re offering online courses, pushing for open data, and a lot more.
  • More and more data — particularly from governments — will come online. The picture today is awful: most government documents are still posted in unstructured form as PDFs and Word docs, making data analysis a lot tougher. That will change.

These changes will allow journalists to more quickly find important stories and tell them more accurately. At a time when some news organizations are slashing budgets and others are defining themselves, that’s important.

Merging Worlds

When I went to Harvard eleven years ago, I couldn’t help feeling like I didn’t quite belong. It was an honor to be part of the Goldsmith Awards, but I was there because I happened to be the grandson of the awards’ founder.

This year, I flew to Boston a day early and spent time with Alex Jones, the longtime Director of the Shorenstein Center. As always, I learned a few things. Alex told me about Journalist’s Resource, a great online tool which lets journalists freely access research on complex topics. He highlighted the increasing role of data in journalism and among many of the top Goldsmith Awards contenders.

While there, I also chatted with Nicco Mele and John Wihbey, both staff members at Shorenstein. Nicco lectures on technology and simultaneously runs a web consultancy; John is the developer behind Journalists’ Resource. Both were full of ideas on how data, journalism, and technology can come together to improve public policy, telling me about cool projects like Journalist’s Resource and Nearby FYI. It was inspiring, and in my conversations I saw a new take on my grandfather’s vision for journalistic impact.

That Saturday night, after a full day selecting the Goldsmith finalists, seven of us met for dinner at Rialto. I was still the introverted techie, and I still didn’t come armed with personal stories about Clintons or Bushes. But having just discussed such a strong set of data-heavy stories, I knew something was different. The landscape has shifted, and journalists have caught on to many of the skills my friends and I value in Silicon Valley. Once just a fly on the wall, the data geek is now an important part of the story.

Thanks to Ben Greenfield for his great feedback on this post.