About this blog

Case Disclosed is a blog written by students, supervising attorneys, directors, alumni, and friends of the Media Freedom & Information Access Clinic.

The views expressed on this blog belong to the author(s) and do not represent the views of Yale Law School or the Media Freedom and Information Access Clinic (MFIA).

National Freedom of Information Coalition

Social Media Mining: The Effects of Big Data In the Age of Social Media

April 3, 2018

“Big data” has become a buzzword in nearly every modern-day industry. Stories like Moneyball1 are praised as paradigmatic examples of the great successes that can come out of data analysis. Big data is undoubtedly a twenty-first century phenomenon, which generates interesting outcomes when it collides with another marvel of this century: social media. This was recently highlighted in the controversy surrounding Facebook and Cambridge Analytica, in which the latter collected information and data on the former’s users.2 This data was used in an effort to influence the 2016 presidential election by catering to individuals’ personal biases. However, Cambridge Analytica is not the only group using social media data to influence large populations. The use of this data has become ubiquitous among researchers, marketers, and the government.

Social media and big data have combined to create a novel field of study called social media mining, which is similar to data mining, but confined to the world of Twitter, Facebook, Instagram, and the like. Social media mining is “the process of representing, analyzing, and extracting actionable patterns from social media data.”3 In simpler terms, social media mining occurs when a company or organization collects data about social media users and analyzes it in an effort to draw conclusions about the populations of these users. The results are often used for targeted marketing campaigns for specific market segments.

A 2017 study published in the Journal of Advertising utilized social media mining techniques to gauge users’ perception of a variety of common brand names.4 The study specifically looked at Twitter, examining tweets about four different brands in each of five industries: fast-food restaurants, department stores, telecommunication carriers, consumer electronics products, and footwear companies. The researchers used a tool called the Twitter Streaming Application Programming Interface (API). This tool, which is provided by Twitter, allows users to pull tweets off of Twitter according to certain keywords.  In this case, the researchers used the Twitter handles of each company (“@CompanyName”) as keywords to pull about ten million tweets about each of the twenty companies studied over a six-month period in 2015. They then used algorithms to sift through the tweets, compile them, and boil them down to a general topic and sentiment. The results were incredibly specific. For example, the study found that 15.7% of tweets about fast-food restaurants were about promotions the chains were offering5  and that 66.7% of tweets about Comcast contained a negative sentiment.6

Many people might find it shocking to know that companies are trawling social media pages in search of information they can use for purposes of marketing. However, studies like the one found in the Journal of Advertising are just the tip of the iceberg.  The use of these data mining tools has become even more invasive.

A study published in October of last year sought to determine how to make best use of digital out-of-home (DOOH) advertisements in the London Underground.7 An example of a DOOH ad would be a digital billboard programed to change the advertisement on display after a specific period of time.  To achieve their goal, the researchers used the same Twitter Streaming API described in the previous study; however, this time they utilized Twitter’s geotagging function (a capability that allows Twitter users to “tag” their location when they post a tweet). Each London Underground station was carefully outlined on a map of London.  Then, the researchers randomly sampled geotagged tweets falling within those zones (meaning the tweeter was at a station).  The specific Underground station, the time of the tweet, and the content of the tweet were all extracted. The researchers continued this practice for one year, seemingly unbeknownst to the Twitter-using patrons of the London Underground, collecting over 10.5 million tweets. This data was then compiled and processed to determine what sort of things people were tweeting about in each London Underground station at certain times of the day on weekdays and on weekends.  For example, nearly 35% of tweets from the Holloway Road station were about sports, and almost 40% of tweets posted between 6 PM and midnight on weekends at the North Greenwich station were about music.8 The authors of the study recommended using this data to create targeted DOOH advertising. For instance, a music-related ad on a rotating digital billboard at night on the weekends in North Greenwich station would probably be more successful than an ad for a sports team.

Social media mining has profound legal and ethical implications, many of which are still developing. Privacy considerations are at the center of the debate on this tool.  Regulation of the use of social media data is important to protect freedom of expression among users of social media.  If users feel that their usage of social media can be used freely by third-party corporations, they will likely feel guarded in their future use of these platforms or will cease using them all together. To remedy these privacy concerns, platforms have policies in place that regulate what information third party companies can access and how they may use that information.9  Furthermore, third-party companies that use social data often have their own policies about how they will use it.  Use of social media data in conflict with these policies can land companies in legal trouble.  Cambridge Analytica’s recent data breach is a prime example.10 Its data mining practices were in conflict with Facebook’s policies. However, upon learning of the breach, Facebook failed to take significant legal action, leading to the current scandal.

Since the advent of social media, the mining of the data we voluntarily offer to these sites has become prevalent. Big data in this form is used to target users and control what content they see. However, this doesn’t end with the advertising of products and services. Cambridge Analytica mined over fifty million Facebook profiles.11 This data was not used to market products to Facebook users, but instead to market political ideologies.  This has raised serious questions about the influence this practice had on both the 2016 election of Donald Trump and the 2016 Brexit vote in the UK. In facing the realities of these events, we are forced to consider whether anything we post on social media can ever actually be private—and how the law needs to evolve to meet these concerns.

1.  MICHAEL LEWIS, MONEYBALL: THE ART OF WINNING AN UNFAIR GAME (2004) (recounting the Oakland Athletics general manager Billy Beane’s use of data and statistics to recruit unconventional baseball players and land the underdog team a spot in the playoffs).

2.  Carole Cadwalladr & Emma Graham-Harrison, Revealed: 50 Million Facebook Profiles Harvested for Cambridge Analytica in Major Data Breach, GUARDIAN (Mar. 3, 2018), https://www.theguardian.com/news/2018/mar/17/cambridge-analytica-faceboo....


4.  Xia Liu et al., An Investigation of Brand-Related User-Generated Content on Twitter, 46 J. ADVERT. 236 (2017).

5.  Id. at 241.

6.  Id. at 242.

7.  Juntao Lai et al., Improved Targeted Outdoor Advertising Based on Geotagged Social Media Data, 23 ANNALS  GIS 237 (2017).

8.  Id. at 248.

9.  Judy Selby et al., Best Practices in Collecting and Using Social Data, BIG LAW BUSINESS (2015).

10.  Cadwalladr & Graham-Harrison, supra note 2.

 11.  Id.