A Little Review of Big Data Books

I recently finished three books on “big data”– Big Data: A Revolution That Will Transform How We Live, Work, and Think, by Viktor Mayer-Schönberger and Kenneth Cukier; Everybody Lies: Big Data, New Data, and What the Internet can tell us about who we Really Are, by Seth Stephens-Davidowitz; and Big Data At Work: Dispelling the Myths, Uncovering the opportunities, by Thomas H. Davenport.

None of these books was a whiz-bang thriller, but I enjoyed them.

Big Data was a very sensible introduction. What exactly is “big data”? It’s not just bigger data sets (though it is also that.) It’s the opportunity to get all the data.

Until now, the authors point out, we have lived in a data poor world. We have had to carefully design our surveys to avoid sampling bias because we just can’t sample that many people. There’s a whole bunch of math done over in statistics to calculate how certain we can be about a particular result, or whether it could just be the result of random chance biasing our samples. I could poll 10,000 people about their jobs, and that might be a pretty good sample, but if everyone I polled happens to live within walking distance of my house, is this a very representative sample of everyone in the country? Now think about all of those studies on the mechanics of sleep done on whatever college students or homeless guys a scientist could convince to sleep in a lab for a week. How representative are they?

Today, though, we suddenly live in a data rich world. An exponentially data rich world. A world in which we no longer need to correct for bias in our sample, because we don’t have to sample. We can just get… all the data. You can go to Google and find out how many people searched for “rabbit” on Tuesday, or how many people misspelled “rabbit” in various ways.

Data is being used in new and interesting (and sometimes creepy) ways. Many things that previously weren’t even considered data are now being quantitized–like one researcher quantitizing people’s backsides to determine whether a car is being driven by its owner, or a stranger.

One application I find promising is using people’s searches for various disease symptoms to identify people who may have various diseases before they seek out a doctor. Catching cancer patients earlier could save millions of lives.

I don’t have the book in front of me anymore, so I am just going by memory, but it made a good companion to Auerswald’s The Code Economy, since the modern economy runs so much on data.

Everybody Lies was a much more lighthearted, annecdotal approach to the subject, discussing lots of different studies. Davidowitz was inspired by Freakonomics, and he wants to use Big Data to uncover hidden truths of human behavior.

The book discusses, for example, people’s pornographic searches, (as per the title, people routinely lie about how much porn they look at on the internet,) and whether people’s pornographic preferences can be used to determine what percent of people in each state are gay. It turns out that we can get a break down of porn queries by state and variety, allowing a rough estimate of the gay and straight population of each state–and it appears that what people are willing to tell pollsters about their sexuality doesn’t match what they search for online. In more conservative states, people are less likely to admit to pollsters that they are gay, but plenty of supposedly “straight” people are searching for gay porn–about the same number of people as actually admit to being gay in more liberal states.

Stephens-Davidowitz uses similar data to determine that people have been lying to pollsters (or perhaps themselves) about whom they plan to vote for. For example, Donald Trump got anomalously high votes in some areas, and Obama got anomalously low votes, compared to what people in those areas told pollsters. However, both of these areas correlated highly with areas of the country where people made a lot of racist Google searches.

Most of the studies discussed are amusing, like the discovery of the racehorse American Pharaoh. Others are quite important, like a study that found that child abuse was probably actually going up at a time when official reports said it wasn’t–the reports probably weren’t showing abuse due to a decrease in funding for investigating abuse.

At times the author steps beyond the studies and offers interpretations of why the results are the way they are that I think go beyond what the data tells, like his conclusion that parents are biased against their daughters because they are more concerned with girls being fat than with boys, or because they are more likely to Google “is my son a genius?” than “is my daughter a genius?”

I can think of a variety of alternative explanations. eg, society itself is crueler to overweight women than to overweight men, so it is reasonable, in turn, for parents to worry more about a daughter who will face cruelty than a boy who will not. Girls are more likely to be in gifted programs than boys, but perhaps this means that giftedness in girls is simply less exceptional than giftedness in boys, who are more unusual. Or perhaps male giftedness is different from female giftedness in some way that makes parents need more information on the topic.

Now, here’s an interesting study. Google can track how many people make Islamophobic searches at any particular time. Compared against Obama’s speech that tried to calm outrage after the San Bernardino attack, this data reveals that the speech was massively unsuccessful. Islamophobic searches doubled during and after the speech. Negative searches about Syrian refugees rose 60%, while searches asking how to help dropped 35%.

In fact, just about every negative search we cold think to test regarding Muslims shot up during and after Obama’s speech, and just about every positive search we could think to test declined. …

Instead of calming the angry mob, as everybody thought he was doing, the internet data tells us that Obama actually inflamed it.

However, Obama later gave another speech, on the same topic. This one was much more successful. As the author put it, this time, Obama spent little time insisting on the value of tolerance, which seems to have just made people less tolerant. Instead, “he focused overwhelmingly on provoking people’s curiosity and changing their perceptions of Muslim Americans.”

People tend to react positively toward people or things they regard as interesting, and invoking curiosity is a good way to get people interested.

The author points out that “big data” is most likely to be useful in fields where the current data is poor. In the case of American Pharaoh, for examples, people just plain weren’t getting a lot of data on racehorses before buying and selling them. It was a field based on people who “knew” horses and their pedigrees, not on people who x-rayed horses to see how big their hearts and lungs were. By contrast, hedge funds investing in the stock market are already up to their necks in data, trying to maximize every last penny. Horse racing was ripe for someone to become successful by unearthing previously unused data and making good predictions; the stock market is not.

And for those keeping track of how many people make it to the end of the book, I did. I even read the endnotes, because I do that.

Big Data At Work was very different. Rather than entertain us with the success of Google Flu or academic studies of human nature, BDAW discusses how to implement “big data” (the author admits it is a silly term) strategies at work. This is a good book if you own, run, or manage a business that could utilize data in some way. UPS, for example, uses driving data to minimize package delivery routes; even a small saving per package by optimizing routes leads to a large saving for the company as a whole, since they deliver so many packages.

The author points out that “big data” often isn’t big so much as unstructured. Photographs, call logs, Facebook posts, and Google searches may all be “data,” but you will need some way to quantitize these before you can make much use of them. For example, companies may want to gather customer feedback reports, feed them into a program that recognizes positive or negative language, and then quantitizes how many people called to report that they liked Product X vs how many called to report that they disliked it.

I think an area ripe for this kind of quantitization is medical data, which currently languishes in doctors’ files, much of it on paper, protected by patient privacy laws. But people post a good deal of information about their medical conditions online, seeking help from other people who’ve dealt with the same diseases. Currently, there are a lot of diseases (take depression) where treatment is very hit-or-miss, and doctors basically have to try a bunch of drugs in a row until they find one that works. A program that could trawl through forum posts and assemble data on patients and medical treatments that worked or failed could help doctors refine treatment for various difficult conditions–“Oh, you look like the kind of patient who would respond well to melatonin,” or “Oh, you have the characteristics that make you a good candidate for Prozac.”

The author points out that most companies will not be able to keep the massive quantities of data they are amassing. A hospital, for example, collects a great deal of data about patient’s heart rates and blood oxygen levels every day. While it might be interesting to look back at 10 years worth of patient heart rate data, hospitals can’t really afford to invest in databanks to store all of this information. Rather, what companies need is real-time or continuous data processing that analyzes current data and makes predictions/recommendations for what the company (or doctor) should do now.

For example, one of the books (I believe it was “Big Data”) discussed a study of premature babies which found, counter-intuitively, that they were most likely to have emergencies soon after a lull in which they had seemed to be doing rather well–stable heart rate, good breathing, etc. Knowing this, a hospital could have a computer monitoring all of its premature babies and automatically updating their status (“stable” “improving” “critical” “likely to have a big problem in six hours”) and notifying doctors of potential problems.

The book goes into a fair amount of detail about how to implement “big data solutions” at your office (you may have to hire someone who knows how to code and may even have to tolerate their idiosyncrasies,) which platforms are useful for data, the fact that “big data” is not all that different from standard analytics that most companies already run, etc. Once you’ve got the data pumping, actual humans may not need to be involved with it very often–for example you may have a system that automatically updates drives’ routes with traffic reports, or sprinklers that automatically turn on when the ground gets too dry.

It is easy to see how “big data” will become yet another facet of the algorithmization of work.

Overall, Big Data at Work is a good book, especially if you run a company, but not as amusing if you are just a lay reader. If you want something fun, read the first two.

Advertisements

Logan Paul and the Algorithms of Outrage

Leaving aside the issues of “Did Logan Paul actually do anything wrong?” and “Is changing YouTube’s policies actually in Game Theorist’s interests?” Game Theorist makes a good point: while YouTube might want to say, for PR reasons, that it is doing something about big, bad, controversial videos like Logan Paul’s, it also makes money off those same videos. YouTube–like many other parts of the internet–is primarily click driven. (Few of us are paying money for programs on YouTube Red.) YouTube wants views, and controversy drives views.

That doesn’t mean YouTube wants just any content–a reputation for having a bunch of pornography would probably have a damaging effect on channels aimed at small children, as their parents would click elsewhere. But aside from the actual corpse, Logan’s video wasn’t the sort of thing that would drive away small viewers–they’d get bored of the boring non-cartoons talking to the camera long before the suicide even came up.

Logan Paul actually managed to hit a very sweet spot: controversial enough to draw in visitors (tons of them) but not so controversial that he’d drive away other visitors.

In case you’ve forgotten the controversy in a fog of other controversies, LP’s video about accidentally finding a suicide in the Suicide Forest was initially well-received, racking up thousands of likes and views before someone got offended and started up the outrage machine. Once the outrage machine got going, public sentiment turned on a dime and LP was suddenly the subject of a full two or three days of Twitter hate. The hate, of course, got YouTube more views. LP took down the video and posted an apology–which generated more attention. Major media outlets were now covering the story. Even Tablet managed to quickly come up with an article: Want a New Years Resolution? Don’t be Like Logan Paul.

And it worked. I passed up Tablet’s regular article on Trump and Bagels and Culture, but I clicked on that article about Logan Paul because I wanted to know what on earth Tablet had to say about LP, a YouTuber whom, 24 hours prior, I had never heard of.

And the more respectable (or at least highly-trafficked) news outlets picked up the story, the higher Logan’s videos rose on the YouTube charts. And as more people watched more of LP’s other videos, they found more things to be offended at. For example, once he ran through the streets of Japan holding a fish. A FISH, I tell you. He waved this fish at people and was generally very annoying.

I don’t like LP’s style of humor, but I’m not getting worked up over a guy waving a fish around.

So understand this: you are in an outrage machine. The purpose of the outrage machine is to drive traffic, which makes clicks, which result in ad revenue. There are probably whole websites (Huffpo, CNN) that derive a significant percent of their profits from hate-clicks–that is, intentionally posting incendiary garbage not because they believe it or think it is just or true or appeals to their base, but because they can get people to click on it in sheer shock or outrage.

Your emotions–your “emotional labor” as the SJWs call it–is being turned into someone else’s dollars.

And the result is a country that is increasingly polarized. Increasingly outraged. Increasingly exhausted.

Step back for a moment. Take a deep breath. Get some fresh air. Ask yourself, “Does this really matter? Am I actually helping anyone? Will I remember this in a week?”

I’d blame the SJWs for the outrage machine–and really, they are good running it–but I think it started with CNN and “24 hour news.” You have to do something to fill that time. Then came Fox News, which was like CNN, but more controversial in order to lure viewers away from the more established channel. Now we have the interplay of Facebook, Twitter, HuffPo, online newspapers, YouTube, etc–driven largely by automated algorithms designed to maximized clicks–even hate clicks.

The Logan Paul controversy is just one example out of thousands, but let’s take a moment and think about whether it really mattered. Some guy whose job description is “makes videos of his life and posts them on YouTube” was already shooting a video about his camping trip when he happened upon a dead body. He filmed the body, called the police, canceled his camping trip, downed a few cups of sake while talking about how shaken he was, and ended the video with a plea that people seek help and not commit suicide.

In between these events was laughter–I interpret it as nervous laughter in an obviously distressed person. Other people interpret this as mocking. Even if you think LP was mocking the deceased, I think you should be more concerned that Japan has a “Suicide Forest” in the first place.

Let’s look at a similar case: When three year old Alan Kurdi drowned, the photograph of his dead body appeared on websites and newspapers around the world–earning thousands of dollars for the photographers and news agencies. Politicans then used little Alan’s death to push particular political agendas–Hillary Clinton even talked about Alan Kurdi’s death in one of the 2016 election debates. Alan Kurdi’s death was extremely profitable for everyone making money off the photograph, but no one got offended over this.

Why is it acceptable for photographers and media agencies to make money off a three year old boy who drowned because his father was a negligent fuck who didn’t put a life vest on him*, but not acceptable for Logan Paul to make money off a guy who chose to kill himself and then leave his body hanging in public where any random person could find it?

Elian Gonzalez, sobbing, torn at gunpoint from his relatives. BTW, This photo won the 2001 Pulitzer Prize for Breaking News.

Let’s take a more explicitly political case. Remember when Bill Clinton and Janet Reno sent 130 heavily armed INS agents to the home of child refugee Elian Gonzalez’s relatives** so they could kick him out of the US and send him back to Cuba?

Now Imagine Donald Trump sending SWAT teams after sobbing children. How would people react?

The outrage machine functions because people think it is good. It convinces people that it is casting light on terrible problems that need correcting. People are getting offended at things that they wouldn’t have if the outrage machine hadn’t told them to. You think you are serving justice. In reality, you are mad at a man for filming a dead guy and running around Japan with a fish. Jackass did worse, and it was on MTV for two years. Game Theorist wants more consequences for people like Logan Paul, but he doesn’t realize that anyone can get offended at just about anything. His videos have graphic descriptions of small children being murdered (in videogame contexts, like Five Nights at Freddy’s or “What would happen if the babies in Mario Cart were involved in real car crashes at racing speeds?”) I don’t find this “family friendly.” Sometimes I (*gasp*) turn off his videos as a result. Does that mean I want a Twitter mob to come destroy his livelihood? No. It means a Twitter mob could destroy his livelihood.

For that matter, as Game Theorist himself notes, the algorithm itself rewards and amplifies outrage–meaning that people are incentivised to create completely false outrage against innocent people. Punishing one group of people more because the algorithm encourages bad behavior in other people is cruel and does not solve the problem. Changing the algorithm would solve the problem, but the algorithm is what makes YouTube money.

In reality, the outrage machine is pulling the country apart–and I don’t know about you, but I live here. My stuff is here; my loved ones are here.

The outrage machine must stop.

*I remember once riding in an airplane with my father. As the flight crew explained that in the case of a sudden loss of cabin pressure, you should secure your own mask before assisting your neighbors, his response was a very vocal “Hell no, I’m saving my kid first.” Maybe not the best idea, but the sentiment is sound.

**When the boat Elian Gonzalez and his family were riding in capsized, his mother and her boyfriend put him in an inner tube, saving his life even though they drowned.