A Little Review of Big Data Books

I recently finished three books on “big data”– Big Data: A Revolution That Will Transform How We Live, Work, and Think, by Viktor Mayer-Schönberger and Kenneth Cukier; Everybody Lies: Big Data, New Data, and What the Internet can tell us about who we Really Are, by Seth Stephens-Davidowitz; and Big Data At Work: Dispelling the Myths, Uncovering the opportunities, by Thomas H. Davenport.

None of these books was a whiz-bang thriller, but I enjoyed them.

Big Data was a very sensible introduction. What exactly is “big data”? It’s not just bigger data sets (though it is also that.) It’s the opportunity to get all the data.

Until now, the authors point out, we have lived in a data poor world. We have had to carefully design our surveys to avoid sampling bias because we just can’t sample that many people. There’s a whole bunch of math done over in statistics to calculate how certain we can be about a particular result, or whether it could just be the result of random chance biasing our samples. I could poll 10,000 people about their jobs, and that might be a pretty good sample, but if everyone I polled happens to live within walking distance of my house, is this a very representative sample of everyone in the country? Now think about all of those studies on the mechanics of sleep done on whatever college students or homeless guys a scientist could convince to sleep in a lab for a week. How representative are they?

Today, though, we suddenly live in a data rich world. An exponentially data rich world. A world in which we no longer need to correct for bias in our sample, because we don’t have to sample. We can just get… all the data. You can go to Google and find out how many people searched for “rabbit” on Tuesday, or how many people misspelled “rabbit” in various ways.

Data is being used in new and interesting (and sometimes creepy) ways. Many things that previously weren’t even considered data are now being quantitized–like one researcher quantitizing people’s backsides to determine whether a car is being driven by its owner, or a stranger.

One application I find promising is using people’s searches for various disease symptoms to identify people who may have various diseases before they seek out a doctor. Catching cancer patients earlier could save millions of lives.

I don’t have the book in front of me anymore, so I am just going by memory, but it made a good companion to Auerswald’s The Code Economy, since the modern economy runs so much on data.

Everybody Lies was a much more lighthearted, annecdotal approach to the subject, discussing lots of different studies. Davidowitz was inspired by Freakonomics, and he wants to use Big Data to uncover hidden truths of human behavior.

The book discusses, for example, people’s pornographic searches, (as per the title, people routinely lie about how much porn they look at on the internet,) and whether people’s pornographic preferences can be used to determine what percent of people in each state are gay. It turns out that we can get a break down of porn queries by state and variety, allowing a rough estimate of the gay and straight population of each state–and it appears that what people are willing to tell pollsters about their sexuality doesn’t match what they search for online. In more conservative states, people are less likely to admit to pollsters that they are gay, but plenty of supposedly “straight” people are searching for gay porn–about the same number of people as actually admit to being gay in more liberal states.

Stephens-Davidowitz uses similar data to determine that people have been lying to pollsters (or perhaps themselves) about whom they plan to vote for. For example, Donald Trump got anomalously high votes in some areas, and Obama got anomalously low votes, compared to what people in those areas told pollsters. However, both of these areas correlated highly with areas of the country where people made a lot of racist Google searches.

Most of the studies discussed are amusing, like the discovery of the racehorse American Pharaoh. Others are quite important, like a study that found that child abuse was probably actually going up at a time when official reports said it wasn’t–the reports probably weren’t showing abuse due to a decrease in funding for investigating abuse.

At times the author steps beyond the studies and offers interpretations of why the results are the way they are that I think go beyond what the data tells, like his conclusion that parents are biased against their daughters because they are more concerned with girls being fat than with boys, or because they are more likely to Google “is my son a genius?” than “is my daughter a genius?”

I can think of a variety of alternative explanations. eg, society itself is crueler to overweight women than to overweight men, so it is reasonable, in turn, for parents to worry more about a daughter who will face cruelty than a boy who will not. Girls are more likely to be in gifted programs than boys, but perhaps this means that giftedness in girls is simply less exceptional than giftedness in boys, who are more unusual. Or perhaps male giftedness is different from female giftedness in some way that makes parents need more information on the topic.

Now, here’s an interesting study. Google can track how many people make Islamophobic searches at any particular time. Compared against Obama’s speech that tried to calm outrage after the San Bernardino attack, this data reveals that the speech was massively unsuccessful. Islamophobic searches doubled during and after the speech. Negative searches about Syrian refugees rose 60%, while searches asking how to help dropped 35%.

In fact, just about every negative search we cold think to test regarding Muslims shot up during and after Obama’s speech, and just about every positive search we could think to test declined. …

Instead of calming the angry mob, as everybody thought he was doing, the internet data tells us that Obama actually inflamed it.

However, Obama later gave another speech, on the same topic. This one was much more successful. As the author put it, this time, Obama spent little time insisting on the value of tolerance, which seems to have just made people less tolerant. Instead, “he focused overwhelmingly on provoking people’s curiosity and changing their perceptions of Muslim Americans.”

People tend to react positively toward people or things they regard as interesting, and invoking curiosity is a good way to get people interested.

The author points out that “big data” is most likely to be useful in fields where the current data is poor. In the case of American Pharaoh, for examples, people just plain weren’t getting a lot of data on racehorses before buying and selling them. It was a field based on people who “knew” horses and their pedigrees, not on people who x-rayed horses to see how big their hearts and lungs were. By contrast, hedge funds investing in the stock market are already up to their necks in data, trying to maximize every last penny. Horse racing was ripe for someone to become successful by unearthing previously unused data and making good predictions; the stock market is not.

And for those keeping track of how many people make it to the end of the book, I did. I even read the endnotes, because I do that.

Big Data At Work was very different. Rather than entertain us with the success of Google Flu or academic studies of human nature, BDAW discusses how to implement “big data” (the author admits it is a silly term) strategies at work. This is a good book if you own, run, or manage a business that could utilize data in some way. UPS, for example, uses driving data to minimize package delivery routes; even a small saving per package by optimizing routes leads to a large saving for the company as a whole, since they deliver so many packages.

The author points out that “big data” often isn’t big so much as unstructured. Photographs, call logs, Facebook posts, and Google searches may all be “data,” but you will need some way to quantitize these before you can make much use of them. For example, companies may want to gather customer feedback reports, feed them into a program that recognizes positive or negative language, and then quantitizes how many people called to report that they liked Product X vs how many called to report that they disliked it.

I think an area ripe for this kind of quantitization is medical data, which currently languishes in doctors’ files, much of it on paper, protected by patient privacy laws. But people post a good deal of information about their medical conditions online, seeking help from other people who’ve dealt with the same diseases. Currently, there are a lot of diseases (take depression) where treatment is very hit-or-miss, and doctors basically have to try a bunch of drugs in a row until they find one that works. A program that could trawl through forum posts and assemble data on patients and medical treatments that worked or failed could help doctors refine treatment for various difficult conditions–“Oh, you look like the kind of patient who would respond well to melatonin,” or “Oh, you have the characteristics that make you a good candidate for Prozac.”

The author points out that most companies will not be able to keep the massive quantities of data they are amassing. A hospital, for example, collects a great deal of data about patient’s heart rates and blood oxygen levels every day. While it might be interesting to look back at 10 years worth of patient heart rate data, hospitals can’t really afford to invest in databanks to store all of this information. Rather, what companies need is real-time or continuous data processing that analyzes current data and makes predictions/recommendations for what the company (or doctor) should do now.

For example, one of the books (I believe it was “Big Data”) discussed a study of premature babies which found, counter-intuitively, that they were most likely to have emergencies soon after a lull in which they had seemed to be doing rather well–stable heart rate, good breathing, etc. Knowing this, a hospital could have a computer monitoring all of its premature babies and automatically updating their status (“stable” “improving” “critical” “likely to have a big problem in six hours”) and notifying doctors of potential problems.

The book goes into a fair amount of detail about how to implement “big data solutions” at your office (you may have to hire someone who knows how to code and may even have to tolerate their idiosyncrasies,) which platforms are useful for data, the fact that “big data” is not all that different from standard analytics that most companies already run, etc. Once you’ve got the data pumping, actual humans may not need to be involved with it very often–for example you may have a system that automatically updates drives’ routes with traffic reports, or sprinklers that automatically turn on when the ground gets too dry.

It is easy to see how “big data” will become yet another facet of the algorithmization of work.

Overall, Big Data at Work is a good book, especially if you run a company, but not as amusing if you are just a lay reader. If you want something fun, read the first two.

Book Club: The Code Economy, Chapter 11: Education and Death

Welcome back to EvX’s book club. Today we’re reading Chapter 11 of The Code Economy, Education.

…since the 1970s, the economically fortunate among us have been those who made the “go to college” choice. This group has seen its income row rapidly and its share of the aggregate wealth increase sharply. Those without a college education have watched their income stagnate and their share of the aggregate wealth decline. …

Middle-age white men without a college degree have been beset by sharply rising death rates–a phenomenon that contrasts starkly with middle-age Latino and African American men, and with trends in nearly every other country in the world.

It turns out that I have a lot of graphs on this subject. There’s a strong correlation between “white death” and “Trump support.”

White vs. non-white Americans

American whites vs. other first world nations

source

But “white men” doesn’t tell the complete story, as death rates for women have been increasing at about the same rate. The Great White Death seems to be as much a female phenomenon as a male one–men just started out with higher death rates in the first place.

Many of these are deaths of despair–suicide, directly or through simply giving up on living. Many involve drugs or alcohol. And many are due to diseases, like cancer and diabetes, that used to hit later in life.

We might at first think the change is just an artifact of more people going to college–perhaps there was always a sub-set of people who died young, but in the days before most people went to college, nothing distinguished them particularly from their peers. Today, with more people going to college, perhaps the destined-to-die are disproportionately concentrated among folks who didn’t make it to college. However, if this were true, we’d expect death rates to hold steady for whites overall–and they have not.

Whatever is affecting lower-class whites, it’s real.

Auerswald then discusses the “Permanent income hypothesis”, developed by Milton Friedman: Children and young adults devote their time to education, (even going into debt,) which allows us to get a better job in mid-life. When we get a job, we stop going to school and start saving for retirement. Then we retire.

The permanent income hypothesis is built into the very structure of our society, from Public Schools that serve students between the ages of 5 and 18, to Pell Grants for college students, to Social Security benefits that kick in at 65. The assumption, more or less, is that a one-time investment in education early in life will pay off for the rest of one’s life–a payout of such returns to scale that it is even sensible for students and parents to take out tremendous debt to pay for that education.

But this is dependent on that education actually paying off–and that is dependent on the skills people learn during their educations being in demand and sufficient for their jobs for the next 40 years.

The system falls apart if technology advances and thus job requirements change faster than once every 40 years. We are now looking at a world where people’s investments in education can be obsolete by the time they graduate, much less by the time they retire.

Right now, people are trying to make up for the decreasing returns to education (a highschool degree does not get you the same job today as it did in 1950) by investing more money and time into the single-use system–encouraging toddlers to go to school on the one end and poor students to take out more debt for college on the other.

This is probably a mistake, given the time-dependent nature of the problem.

The obvious solution is to change how we think of education and work. Instead of a single, one-time investment, education will have to continue after people begin working, probably in bursts. Companies will continually need to re-train workers in new technology and innovations. Education cannot be just a single investment, but a life-long process.

But that is hard to do if people are already in debt from all of the college they just paid for.

Auerswald then discusses some fascinating work by Bessen on how the industrial revolution affected incomes and production among textile workers:

… while a handloom weaver in 1800 required nearly forty minutes to weave a yard of coarse cloth using a single loom, a weaver in 1902 could do the same work operating eighteen Nothrop looms in less than a minute, on average. This striking point relates to the relative importance of the accumulation of capital to the advance of code: “Of the roughly thirty-nine-minute reduction in labor time per yard, capital accumulation due to the changing cost of capital relative to wages accounted for just 2 percent of the reduction; invention accounted for 73 percent of the reduction; and 25 percent of the time saving came from greater skill and effort of the weavers.” … “the role of capital accumulation was minimal, counter to the conventional wisdom.”

Then Auerswald proclaims:

What was the role of formal education in this process? Essentially nonexistent.

Boom.

New technologies are simply too new for anyone to learn about them in school. Flexible thinkers who learn fast (generalists) thus benefit from new technologies and are crucial for their early development. Once a technology matures, however, it becomes codified into platforms and standards that can be taught, at which point demand for generalists declines and demand for workers with educational training in the specific field rises.

For Bessen, formal education and basic research are not the keys to the development of economies that they are often represented a being. What drives the development of economies is learning by doing and the advance of code–processes that are driven at least as much by non-expert tinkering as by formal research and instruction.

Make sure to read the endnotes to this chapter; several of them are very interesting. For example, #3 begins:

“Typically, new technologies demand that a large number of variables be properly controlled. Henry Bessemer’s simple principle of refining molten iron with a blast of oxygen work properly only at the right temperatures, in the right size vessel, with the right sort of vessel refractory lining, the right volume and temperature of air, and the right ores…” Furthermore, the products of these factories were really one that, in the United States, previously had been created at home, not by craftsmen…

#8 states:

“Early-stage technologies–those with relatively little standardized knowledge–tend to be used at a smaller scale; activity is localized; personal training and direct knowledge sharing are important, and labor markets do not compensate workers for their new skills. Mature technologies–with greater standardized knowledge–operate at large scale and globally, market permitting; formalized training and knowledge are more common; and robust labor markets encourage workers to develop their own skills.” … The intensity of of interactions that occur in cities is also important in this phase: “During the early stages, when formalized instruction is limited, person-to-person exchange is especially important for spreading knowledge.”

This reminds me of a post on Bruce Charlton’s blog about “Head Girl Syndrome“:

The ideal Head Girl is an all-rounder: performs extremely well in all school subjects and has a very high Grade Point Average. She is excellent at sports, Captaining all the major teams. She is also pretty, popular, sociable and well-behaved.

The Head Girl will probably be a big success in life…

But the Head Girl is not, cannot be, a creative genius.

*

Modern society is run by Head Girls, of both sexes, hence there is no place for the creative genius.

Modern Colleges aim at recruiting Head Girls, so do universities, so does science, so do the arts, so does the mass media, so does the legal profession, so does medicine, so does the military…

And in doing so, they filter-out and exclude creative genius.

Creative geniuses invent new technologies; head girls oversee the implementation and running of code. Systems that run on code can run very smoothly and do many things well, but they are bad at handling creative geniuses, as many a genius will inform you of their public school experience.

How different stages in the adoption of new technology and its codification into platforms translates into wages over time is a subject I’d like to see more of.

Auerswald then turns to the perennial problem of what happens when not only do the jobs change, they entirely disappear due to increasing robotification:

Indeed, many of the frontier business models shaping the economy today are based on enabling a sharp reduction in the number of people required to perform existing tasks.

One possibility Auerswald envisions is a kind of return to the personalized markets of yesteryear, when before massive industrial giants like Walmart sprang up. Via internet-based platforms like Uber or AirBNB, individuals can connect directly with people who’d like to buy their goods or services.

Since services make up more than 84% of the US economy and an increasingly comparable percentage in coutnries elsewhere, this is a big deal.

It’s easy to imagine this future in which we are all like some sort of digital Amish, continually networked via our phones to engage in small transactions like sewing a pair of trousers for a neighbor, mowing a lawn, selling a few dozen tacos, or driving people to the airport during a few spare hours on a Friday afternoon. It’s also easy to imagine how Walmart might still have massive economies of scale over individuals and the whole system might fail miserably.

However, if we take the entrepreneurial perspective, such enterprises are intriguing. Uber and Airbnb work by essentially “unlocking” latent assets–time when people’s cars or homes were sitting around unused. Anyone who can find other, similar latent assets and figure out how to unlock them could become similarly successful.

I’ve got an underutilized asset: rural poor. People in cities are easy to hire and easy to direct toward educational opportunities. Kids growing up in rural areas are often out of the communications loop (the internet doesn’t work terribly well in many rural areas) and have to drive a long way to job interviews.

In general, it’s tough to network large rural areas in the same ways that cities get networked.

On the matter of why peer-to-peer networks have emerged in certain industries, Auerswald makes a claim that I feel compelled to contradict:

The peer-to-peer business models in local transportation, hospitality, food service, and the rental of consumer goods were the first to emerge, not because they are the most important for the economy but because these are industries with relatively low regulatory complexity.

No no no!

Food trucks emerged because heavy regulations on restaurants (eg, fire code, disability access, landscaping,) have cut significantly into profits for restaurants housed in actual buildings.

Uber emerged because the cost of a cab medallion–that is, a license to drive a cab–hit 1.3 MILLION DOLLARS in NYC. It’s a lucrative industry that people were being kept out of.

In contrast, there has been little peer-to-peer business innovation in healthcare, energy, and education–three industries that comprise more than a quarter of the US GDP–where regulatory complexity is relatively high.

Again, no.

There is a ton of competition in healthcare; just look up naturopaths and chiropractors. Sure, most of them are quacks, but they’re definitely out there, competing with regular doctors for patients. (Midwives appear to be actually pretty effective at what they do and significantly cheaper than standard ob-gyns.)

The difficulty with peer-to-peer healthcare isn’t regulation but knowledge and equipment. Most Americans own a car and know how to drive, and therefore can join Uber. Most Americans do not know how to do heart surgery and do not have the proper equipment to do it with. With training I might be able to set a bone, but I don’t own an x-ray machine. And you definitely don’t want me manufacturing my own medications. I’m not even good at making soup.

Education has tons of peer-to-peer innovation. I homeschool my children. Sometimes grandma and grandpa teach the children. Many homeschoolers join consortia that offer group classes, often taught by a knowledgeable parent or hired tutor. Even people who aren’t homeschooling their kids often hire tutors, through organizations like Wyzant or afterschool test-prep centers like Kumon. One of my acquaintances makes her living primarily by skype-tutoring Koreans in English.

And that’s not even counting private schools.

Yes, if you want to set up a formal “school,” you will encounter a lot of regulation. But if you just want to teach stuff, there’s nothing stopping you except your ability to find students who’ll pay you to learn it.

Now, energy is interesting. Here Auerswsald might be correct. I have trouble imagining people setting up their own hydroelectric dams without getting into trouble with the EPA (not to mention everyone downstream.)

But what if I set up my own windmill in my backyard? Can I connect it to the electric grid and sell energy to my neighbors on a windy day? A quick search brings up WindExchange, which says, very directly:

Owners of wind turbines interconnected directly to the transmission or distribution grid, or that produce more power than the host consumes, can sell wind power as well as other generation attributes.

So, maybe you can’t set up your own nuclear reactor, and maybe the EPA has a thing about not disturbing fish, but it looks like you can sell wind and solar energy back to the grid.

I find this a rather exciting thought.

Ultimately, while Auerswald does return to and address the need to radically change how we think about education and the education-job-retirement lifepath, he doesn’t return to the increasing white death rate. Why are white death rates increasing faster than other death rates, and will transition to the “gig economy” further accelerate this trend? Or was the past simply anomalous for having low white death rates, or could these death rates be driven by something independent of the economy itself?

Now, it’s getting late, so that’s enough for tonight, but what are your thoughts? How do you think this new economy–and educational landscape–will play out?