A Little Review of Big Data Books

I recently finished three books on “big data”– Big Data: A Revolution That Will Transform How We Live, Work, and Think, by Viktor Mayer-Schönberger and Kenneth Cukier; Everybody Lies: Big Data, New Data, and What the Internet can tell us about who we Really Are, by Seth Stephens-Davidowitz; and Big Data At Work: Dispelling the Myths, Uncovering the opportunities, by Thomas H. Davenport.

None of these books was a whiz-bang thriller, but I enjoyed them.

Big Data was a very sensible introduction. What exactly is “big data”? It’s not just bigger data sets (though it is also that.) It’s the opportunity to get all the data.

Until now, the authors point out, we have lived in a data poor world. We have had to carefully design our surveys to avoid sampling bias because we just can’t sample that many people. There’s a whole bunch of math done over in statistics to calculate how certain we can be about a particular result, or whether it could just be the result of random chance biasing our samples. I could poll 10,000 people about their jobs, and that might be a pretty good sample, but if everyone I polled happens to live within walking distance of my house, is this a very representative sample of everyone in the country? Now think about all of those studies on the mechanics of sleep done on whatever college students or homeless guys a scientist could convince to sleep in a lab for a week. How representative are they?

Today, though, we suddenly live in a data rich world. An exponentially data rich world. A world in which we no longer need to correct for bias in our sample, because we don’t have to sample. We can just get… all the data. You can go to Google and find out how many people searched for “rabbit” on Tuesday, or how many people misspelled “rabbit” in various ways.

Data is being used in new and interesting (and sometimes creepy) ways. Many things that previously weren’t even considered data are now being quantitized–like one researcher quantitizing people’s backsides to determine whether a car is being driven by its owner, or a stranger.

One application I find promising is using people’s searches for various disease symptoms to identify people who may have various diseases before they seek out a doctor. Catching cancer patients earlier could save millions of lives.

I don’t have the book in front of me anymore, so I am just going by memory, but it made a good companion to Auerswald’s The Code Economy, since the modern economy runs so much on data.

Everybody Lies was a much more lighthearted, annecdotal approach to the subject, discussing lots of different studies. Davidowitz was inspired by Freakonomics, and he wants to use Big Data to uncover hidden truths of human behavior.

The book discusses, for example, people’s pornographic searches, (as per the title, people routinely lie about how much porn they look at on the internet,) and whether people’s pornographic preferences can be used to determine what percent of people in each state are gay. It turns out that we can get a break down of porn queries by state and variety, allowing a rough estimate of the gay and straight population of each state–and it appears that what people are willing to tell pollsters about their sexuality doesn’t match what they search for online. In more conservative states, people are less likely to admit to pollsters that they are gay, but plenty of supposedly “straight” people are searching for gay porn–about the same number of people as actually admit to being gay in more liberal states.

Stephens-Davidowitz uses similar data to determine that people have been lying to pollsters (or perhaps themselves) about whom they plan to vote for. For example, Donald Trump got anomalously high votes in some areas, and Obama got anomalously low votes, compared to what people in those areas told pollsters. However, both of these areas correlated highly with areas of the country where people made a lot of racist Google searches.

Most of the studies discussed are amusing, like the discovery of the racehorse American Pharaoh. Others are quite important, like a study that found that child abuse was probably actually going up at a time when official reports said it wasn’t–the reports probably weren’t showing abuse due to a decrease in funding for investigating abuse.

At times the author steps beyond the studies and offers interpretations of why the results are the way they are that I think go beyond what the data tells, like his conclusion that parents are biased against their daughters because they are more concerned with girls being fat than with boys, or because they are more likely to Google “is my son a genius?” than “is my daughter a genius?”

I can think of a variety of alternative explanations. eg, society itself is crueler to overweight women than to overweight men, so it is reasonable, in turn, for parents to worry more about a daughter who will face cruelty than a boy who will not. Girls are more likely to be in gifted programs than boys, but perhaps this means that giftedness in girls is simply less exceptional than giftedness in boys, who are more unusual. Or perhaps male giftedness is different from female giftedness in some way that makes parents need more information on the topic.

Now, here’s an interesting study. Google can track how many people make Islamophobic searches at any particular time. Compared against Obama’s speech that tried to calm outrage after the San Bernardino attack, this data reveals that the speech was massively unsuccessful. Islamophobic searches doubled during and after the speech. Negative searches about Syrian refugees rose 60%, while searches asking how to help dropped 35%.

In fact, just about every negative search we cold think to test regarding Muslims shot up during and after Obama’s speech, and just about every positive search we could think to test declined. …

Instead of calming the angry mob, as everybody thought he was doing, the internet data tells us that Obama actually inflamed it.

However, Obama later gave another speech, on the same topic. This one was much more successful. As the author put it, this time, Obama spent little time insisting on the value of tolerance, which seems to have just made people less tolerant. Instead, “he focused overwhelmingly on provoking people’s curiosity and changing their perceptions of Muslim Americans.”

People tend to react positively toward people or things they regard as interesting, and invoking curiosity is a good way to get people interested.

The author points out that “big data” is most likely to be useful in fields where the current data is poor. In the case of American Pharaoh, for examples, people just plain weren’t getting a lot of data on racehorses before buying and selling them. It was a field based on people who “knew” horses and their pedigrees, not on people who x-rayed horses to see how big their hearts and lungs were. By contrast, hedge funds investing in the stock market are already up to their necks in data, trying to maximize every last penny. Horse racing was ripe for someone to become successful by unearthing previously unused data and making good predictions; the stock market is not.

And for those keeping track of how many people make it to the end of the book, I did. I even read the endnotes, because I do that.

Big Data At Work was very different. Rather than entertain us with the success of Google Flu or academic studies of human nature, BDAW discusses how to implement “big data” (the author admits it is a silly term) strategies at work. This is a good book if you own, run, or manage a business that could utilize data in some way. UPS, for example, uses driving data to minimize package delivery routes; even a small saving per package by optimizing routes leads to a large saving for the company as a whole, since they deliver so many packages.

The author points out that “big data” often isn’t big so much as unstructured. Photographs, call logs, Facebook posts, and Google searches may all be “data,” but you will need some way to quantitize these before you can make much use of them. For example, companies may want to gather customer feedback reports, feed them into a program that recognizes positive or negative language, and then quantitizes how many people called to report that they liked Product X vs how many called to report that they disliked it.

I think an area ripe for this kind of quantitization is medical data, which currently languishes in doctors’ files, much of it on paper, protected by patient privacy laws. But people post a good deal of information about their medical conditions online, seeking help from other people who’ve dealt with the same diseases. Currently, there are a lot of diseases (take depression) where treatment is very hit-or-miss, and doctors basically have to try a bunch of drugs in a row until they find one that works. A program that could trawl through forum posts and assemble data on patients and medical treatments that worked or failed could help doctors refine treatment for various difficult conditions–“Oh, you look like the kind of patient who would respond well to melatonin,” or “Oh, you have the characteristics that make you a good candidate for Prozac.”

The author points out that most companies will not be able to keep the massive quantities of data they are amassing. A hospital, for example, collects a great deal of data about patient’s heart rates and blood oxygen levels every day. While it might be interesting to look back at 10 years worth of patient heart rate data, hospitals can’t really afford to invest in databanks to store all of this information. Rather, what companies need is real-time or continuous data processing that analyzes current data and makes predictions/recommendations for what the company (or doctor) should do now.

For example, one of the books (I believe it was “Big Data”) discussed a study of premature babies which found, counter-intuitively, that they were most likely to have emergencies soon after a lull in which they had seemed to be doing rather well–stable heart rate, good breathing, etc. Knowing this, a hospital could have a computer monitoring all of its premature babies and automatically updating their status (“stable” “improving” “critical” “likely to have a big problem in six hours”) and notifying doctors of potential problems.

The book goes into a fair amount of detail about how to implement “big data solutions” at your office (you may have to hire someone who knows how to code and may even have to tolerate their idiosyncrasies,) which platforms are useful for data, the fact that “big data” is not all that different from standard analytics that most companies already run, etc. Once you’ve got the data pumping, actual humans may not need to be involved with it very often–for example you may have a system that automatically updates drives’ routes with traffic reports, or sprinklers that automatically turn on when the ground gets too dry.

It is easy to see how “big data” will become yet another facet of the algorithmization of work.

Overall, Big Data at Work is a good book, especially if you run a company, but not as amusing if you are just a lay reader. If you want something fun, read the first two.