Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are, By Seth Stephens-Davidowitz

RATING:

1 star

0 stars = good but not on the scale

1 star = perspective supplementing

2 stars = perspective influencing

3 stars = perspective altering

SHORT SUMMARY (272 words or less)

This book focuses on big data analysis.  How large data sets enable a more honest view into human trends.  The author theorizes four virtues of big data:  (1) size; (2) honesty; (3) unique segmentation opportunities; and (4) untapped data sources.  As a main example, the author’s theory is that people type into Google things they otherwise would not reveal in public, and the database size allows us to learn more revealing tendencies in population segments.

I think this is an important book.  The underlying theories are interesting, and big data analysis already impacts most aspects of our lives.  For example, A/B testing keeps you on that app or website just a little longer than you otherwise would stay.  It’s important to understand these assumptions.  

The book seems to follow two threads.  One is the specific trends that are observed from big data analysis.  You can read about them below, the most interesting to me was the doppelgänger analysis.  The second involves the virtues of big data analysis itself.  It’s here where I have some comments, not doubts, on the assumptions.  Can the analysis become self fulfilling?  After reading this, I see big data sets as a land rush that’s soon ending.  As data companies consolidate into larger data sets, is there room for other sets come into existence?  Or, are we starting from a limited number of sources to determine trends, theories, etc., and then perpetuating those prophecies.  In other words, if Google is a starting point, is there room for another Google-like data source, years down the line, to update the resulting conclusions?  Do we stop buying strawberry pop tarts?

LONG SUMMARY

Started on January 5, 2017:

-Over the Christmas holiday, some good friends recommended this book, so I put it next in my audible queue.

-About 2 hours through, I have some general thoughts.  Initially, the author talks about the motivation of this book.  It’s derived from his PhD work which involved using Google data to infer trends.  He goes into a discussion about fallacies related to the law of small numbers (referencing Daniel Kahneman similar criticism [I think] of the topic in his “Thinking, Fast and Slow” book).  I’m not really sure of the details here, but from what I gather, this book is going to be a defense of big data.

-Much of this book will be filled with correlation trends sourced from big data repositories like Google, etc.  So throughout this summary, there will probably be a lot of comments like “during X times, people tend to search for Y” or “people from X area tend to search for Y” or “people who exhibit X behavior tend to have Y characteristics”

-I think an initial disclaimer should be made though.  As I’m listening to this book, I kind of get the feeling that it should really be two different books.  The author seems to be chasing two different, but not mutually exclusive, themes.  The first is Gladwell-ian, showing how certain trends are related to certain determining factors (e.g., the NBA example below).  The second is a defense of big data itself.  Both are interesting, but it feels a little clunky how he bounces between those themes.

-The central theme of his book is that people tell Google (and other services) information that they wouldn’t admit in person (also known as “social desirability bias”), and thus, the data sets provided by these services reveal richer information about societies on a large scale.

-The first chapter goes into an interesting discussion about NBA players.  How a conventional myth is that NBA players, particularly African American players, are born and come up in tough inner city environments.  The author takes this assumption and runs it through large and/or extensive data sets to debunk that.  Most NBA players tend to come from middle income or affluent zip codes or counties, with two-parent and middle class family upbringing.  The point being that the story of LeBron is the exception, not the rule.

-The author also goes into a discussion of data analysis taken from pornographic website databases.  He finds some, um, interesting trends.

-His trends that he discovers regarding racist and hateful speech is just as disturbing, particularly his description of there being an uptick in racist joke searches after Obama’s election and on Martin Luther King Jr. Day.  If anything, the author is able to see into the hidden (or maybe not so hidden) underbelly of our society through this data anlysis.

-The author extolls the virtues of big data with four attributes:  (1) big data is, well, big, so it allows us to run large scale experiments that would otherwise be very time, labor and cost intensive, and allows us to glean correlations and causations from these experiments; (2) big data is honest–the point above, that people tell Google things they would never admit in public; (3) big data allows us to segment data in unique ways due to its size; and (4) big data provides new types of data that would otherwise be unavailable (e.g., author says to think about the pornographic database that Freud and other psychologists would have loved to access).

-All of this makes sense, except I have one piece of criticism.  Big data sources may be honest–in that the data is not false–but they may not necessarily be complete.  For example, I wonder how susceptible to skewing these derived trends are.  Are you more likely to type into Google “I hate my job” than you are “I love my job”?

-The author mentions how big data analysis may not always be about understanding why a trend is the way it is, but rather that a trend itself exists. Interesting examples of this from the book is horse racing. The traditional analysis to identify potentially successful racehorses involved looking at a horse’s genetic lineage. Author mentions that American Pharaoh, which went on to win the triple crown, was identified via data analysis. The analysis from a lot of previous racehorses indicated that left ventricle size was strongly correlated to a racehorses success. American Pharaoh at age 1 was in the middle percentiles for most physical attributes except for left ventricle size, for which it was in the 99th.

-Another interesting example. Walmart determined, based on an evaluation of all their sales data, that people tend to stock up of strawberry pop tarts before a hurricane. Not just any pop tarts, but specifically strawberry pop tarts. So now, before hurricanes hit, they send high quantities of strawberry pop tarts down I-95 toward Florida, and they sell fast. Walmart does not necessarily know why his happens, just that it does happen.

Which takes me to another thought. Is this a hidden danger in big data analysis? Taking the trivial strawberry pop tart example, is there a danger of further reinforcing a trend once we have identified that a trend exists? Walmart initially knows that these pop tarts sell out, based on presumably years of data. But then, because they know this, they double down and reinforce the data by targeting the supply as needed. On one hand, of course you’d want to make money once you identify a market demand. But the larger question to me- if you pick a point in time to do an analysis, and then look at the data backwards from that point in time to identify trends, and then move at large scales to address the trends, have you in essence frozen out the ability for the trend to change? It seems possible that big data can amplify and self-perpetuate; Walmart knows that people eat strawberry pop tarts before a hurricane, stocks their shelves, and then people buy more strawberry pop tarts because there are more available.

-The United States “is” vs The United States “are”. The book goes into a historical fascination of when in US history citizens started referring to the United States as a singular country (saying, for example, the United States “is”) vs a confederation of states (the United States “are”). Conventional historical theory says that this verb shift happened right after the Civil War. The author points out that Google Ngram- which is capable of word analysis through the digitization of text- indicates that plural verbs were still being used in dominance, 15 years after he Civil War. So maybe the shift in the public perception of the US as one country rather than a federation happened much later.

By the way, I just played around with Google Ngrams, pretty cool feature.

-This book is filled with cool little facts. One analysis of Facebook data showed differences between men and women’s language. The sterotypical ones, men are more likely to talk about football and Xbox, women more likely to talk about shopping and hair. Also this data indicates that men curse more than women. Another interesting analysis on age breakdown. The author calls it “Drink-Work-Pray”. Younger people post about partying, middle age about work, older about prayer/spirituality.

-Sentiment analysis: determining the mood of a particular text.

-Sentiment analysis described in the book indicates that there a large percentage of stories fits into one of six structures :

1. Rags to riches- rise

2. Riches to rags – fall

3. Man in a hole – fall then rise

4. Icarus – rise then fall

5. Cinderella – rise then fall then rise

6. Oedipus- fall then rise then fall

-Author makes an interesting point about how social networks actually expose you to more diverse viewpoints. This is counter to traditional notion of social network echo chambers. The authors argument involves a survey of data that projects your political compatibility relative to people in your life. For example, the likelihood that you and a coworker, for example, will have differing political views. The survey data indicates that as you get closer and closer to real life connections, the likelihood of disagreement is lower than your online connections (or random online pairing). The theory is that your online social connections are weak social links, and this you are less likely to self select them out because you wouldn’t really hang out with them in person anyway. An interesting take for sure, I’ll have to think about that.

-Good point about how truth serum data like google can help the most vulnerable populations that don’t admit or problems publicly or underreport them. For example, child abuse victims, women seeking off the grid abortions, etc.

-While google may be a truth serum data source, Facebook and other social media may be the opposite. On social media, people have the incentive not to necessarily tell the whole truth. I think the same could be said about the truth serum data sources too though. Interesting idea though, of Digital truths vs digital lies

-One analysis indicated that political views tend to  be formed from the ages of 14-24. People’s political views don’t grow more conservative as time goes on, but rather, they are affected based on their perceptions during their formative years.  People who were 14-24 during a popular Eisenhower remained more conservative throughout their lives, as were those who were that age during the Kennedy and early LBJ years.  People who were 14-24 during unpopular presidencies like Nixon showed those predispositions later in life too.

-Sports allegiances for boys from 9-19, for girls (women) at the age of 22.

-The author does a county by county analysis of “famous” baby boomers (those baby boomers who have Wikipedia entries that are not for bad acts). His goal is to try and project out exactly which areas are more conducive to success-as defined by appearing in Wikipedia. His findings are that cities over suburbs produce more famous/successful people, as do college towns, and areas with super specializing (eg due to a specific industry within a city or county). It’s an interesting analysis. Though I do question the assumptions. In general, this is my fundamental view of the book, the analysis reveals very interesting trends, but also simultaneously, it’s incredibly reliant on the data sources or combinations of data sources to feed the base assumption. Not saying I disagree, food for thought though. What if there was analysis on what the ideal data source(s) should look like, and do we have those already in existence?

-More people search broad philosophical questions on google between 2am and 4am, which to me is interestingly as it relates to the Hindu concept of “brahmamuhurtha”

-Doppelganger research – The author provides a discussion of doppelganger research in baseball player analysis.  The story involved David Ortiz, how his performance dipped in the 2009 season and how according to conventional wisdom, he should have been let go or traded.  But Nate Silver (I believe) did a doppelganger analysis on David Ortiz by basically taking a record of every major league player and as many stats available for each player, and then finding the closest player that mirrored Big Papi’s performance int he ’09 season.  The results indicate that he had not hit his peak, and when the Red Sox kept him (not sure if they kept him for this analysis or not, I kind of zoned out), it proved to be true.  Ortiz ended up becoming a multi-year all star after his ’09 season and being named 2013 World Series MVP after the Red Sox won.

The author’s larger point is that doppelganger research can be incredibly valuable.  Amazon uses it for book recommendations–they find someone who is not exactly like you, but almost exactly like you, and provides you with book recommendations.  Netflix does the same thing with movie recommendations.  The author mentions that it’s astonishing how under utilized this data analysis technique is.

-A/B testing.  The author discusses the power of A/B testing and it’s prevalence for internet communities, websites, networks and products.  A/B testing basically involves selecting two population groups and showing one group an “A” version of a web feature and another group a “B” version of a different feature (for example, font colors, text, etc.).  The power of A/B testing is that hypothesis about what keeps consumers attention can be tested and iterate very quickly relative to the non-internet world.  The author make a good point about the potential dangers–does this ability inevitably lead internet sites to unlock the attention draws of the human mind to create addictive behavior.  By honing in over and over on UX that draws in users, the argument could be made.

2 thoughts on “Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are, By Seth Stephens-Davidowitz

Leave a comment