The things people say on Twitter or share on Facebook are pretty trivial individually.
When you collect the 5 billion or so items posted to Facebook, Twitter and other social networks every day into one massive database, those bits of individual drivel combine into a massive pointillist masterpiece that is already changing the way governments and corporations relate to individual humans, allowing marketers to tailor products more precisely to the customer preferences (and target spam campaigns more effectively) with up to the minute insight into the thinking of their constituents.
Deep analysis of social-network data has changed online marketing so quickly that senior-level marketing executives are still struggling to come to grips with their new power to analyze customers, according to the CMO Council that represents them.
Which is a shame, because the picture all those petabytes of badly spellchecked musings provide of the thoughts or preferences of actual customers is mostly wrong, according to a study published in today's issue of the journal Science.
The problem isn't with the data; the problem is with the way data is presented and analyzed, according to the article's authors, Derek Ruths of McGill University and Jurgen Pfeffer of Carnegie Mellon.
Social-media datasets often munge together all those personal revelations into a big picture without correcting for things that make a big difference in their accuracy – like the demographic differences between social network populations, the type of information usually posted on each the number of bots and spammers pretending to be human users and even the effect of the site design on the tone of the content posted.
Facebook, which is the single largest contributor to the social-network-data universe, has a Like button but not a Dislike button, which makes it harder to detect a negative reaction to a particular piece of data, the two argue.
Facebook, which is used by about 71 percent of Americans skews significantly female, young and (relatively) lower income, according to December, 2013 survey by the Pew Research Center. Seventy-six percent of women polled use Facebook compared to 66 percent of men; 84 percent of those between the ages of 18 and 29 use Facebook, as did 76 percent of those with incomes under $50,000 per year.
Twitter is almost gender balanced, but twice as many African-American respondents said they Tweet than either white or Latino, and its user numbers skew far more heavily toward those in the 18-29-year-old age group (31 percent) than Facebook.
Instagram is 28 percent more female than male, but is far less skewed than Pinterest, which attracts five times more women than men.
LinkedIn is more male than female (24 percent to 19 percent) and more black than white but skews drastically toward the middle ages (30 years old to 64 years old), college-educated and upper income (38 percent make $75,000 per year or more).
Researchers and service firms that collect, clean and sell social-media data sets often slot users into easy-to-identify groups according to age, income and other variables, which make the data look more consistent than the users they came from, according to Ruths and Pfeffer.
Even worse are reports that use smoothed-over data with analytics that are a little too smug to infer things like a user's political affiliation.
Even using analysis methods that are "sound and often methodologically novel," Ruths and a co-author wrote in an earlier paper, "reported accuracies have been systematically overoptimistic due to the way in which the validation datasets have been collected."
The real accuracy levels for political affiliation are closer to 65 percent than the oft-reported 90- percent Ruths and Pfeffer wrote.
Far from being unfixable, however, miscalculations in social-media analyses can already be fixed using methods developed to fix similar problems in studies in epidemiology, statistics and machine learning.
"The common thread in all these issues is the need for researchers to be more acutely aware of what they're actually analyzing when working with social media data," according to Ruth, who compared social-media mis-analysis to the flaw in survey methodology that produced the "Dewey Defeats Truman" headline from the 1948 Presidential election. That survey, which was done by telephone, drastically underestimated the number of Truman supporters, many of whom, in the days before telephones became ubiquitous even in rural areas, didn't have phones.
"We’re poised at a similar technological inflection point. By tackling the issues we face, we’ll be able to realize the tremendous potential for good promised by social media-based research," Ruths said in a McGill press release about the paper's publication.
Fortunately for marketers hoping to produce social-media analyses with results that won't send their companies racing off in very close to the right direction, there are already projects underway to fix social media's identity problem.
In October Twitter announced it was giving the MIT Media a $10 million grant and the promise of a real-time public feed of Twitter data to create analytical tools that would bring deeper, more accurate insights into the meaning of billions of Tweets.
The Social Media Research Group at the Queensland University of Technology in Brisbane, Australia have actually come out with a Web-based platform with algorithms specifically designed to provide an "academically rigorous" analysis of social-network data.
The online service, which is available to a few early testers but is otherwise still under development, was announced Nov. 11.
"We want to move the analytics discussion beyond counts such as likes, favorites and retweets into prompting action based on real-time content and metrics placed in national and industry contexts," according to an announcement quoting co-developer Darryl Woodford, a research fellow at the university.