Insights to Incite | Transcript: Big Data

March 25, 2021 • 12 Minutes

Big Data

Should we be worried about big data? Do we live in a disciplinary society by means of a digital panopticon?

The panopticon. It’s a system of control designed by the English philosopher and social theorist Jeremy Bentham.

The design of the building allows all inmates of an institution to be observed by a single watchman without the inmates being able to tell whether or not they are being watched.

The fact that the inmates cannot know when they are being watched means they are motivated to act as if they are being watched at all times.

It’s an interesting design for a prison, right?

Michel Foucault, a French philosopher, used the panopticon as a metaphor for modern disciplinary societies. Specifically, their pervasive inclination to observe and normalize.

Instead of actual surveillance, the mere threat of surveillance is what disciplines society into behaving according to rules and norms.

People are watching and spectating, reinforcing the disciplinary society.

Does your life online feel like this? Do we live in a modern disciplinary society?

Do tech companies, with the help of our data, contribute to our inclination to normalize? To become complacent?

Big Data. Data sets with sizes beyond the ability of commonly used software tools to wrangle in a tolerable manner.

It encompasses unstructured, semi-structured and structured data.

I click, it’s tracked, stored in a database somewhere. A yes or a no. That’s structured data. You can standardize the elements with rows and columns, organizing how they relate to one another.

I type in a list of things I like, separated by a comma. That’s semi structured data. It doesn’t conform to a formal structure, but with tags and other markers, it’s able to be parsed and enforce hierarchies of records and fields.

I post my status on Facebook. It’s a wall of text. There is no data model. My words in that textbox aren’t organized in a pre-defined manner. That’s unstructured data.

The kind of data that is the primary focus of big data.

Data scientists focus on the four V’s of big data: volume, variety, velocity, and veracity.

Volume is the quantity of data that is generated and stored. The size of this data determines the value and potential insights.

Walmart handles more than 1 million customer transactions every hour. These transactions are imported into databases, estimated to contain more than 2.5 petabytes of data.

A petabyte is a thousand terabytes.

A terabyte is a thousand gigabytes.

Walmart’s transaction database is equivalent to 167 times the information contained in all the books in the US Library of Congress.

Variety is the type and nature of the data. It helps people analyzing the data to effectively generate insights.

Big data draws all types of data: text, images, audio, video, you name it.

It completes missing pieces through what’s called data fusion.

In China, the IJOP is used for predictive policing. People are flagged that the software deems as potentially threatening to officials.

DNA samples are gathered through free physicals.

CCTVs are placed in areas like entertainment venues, supermarkets, schools, and homes of religious figures.

Everyone is watched. While all of this would be hard to track for humans, computers are more than capable of processing everything.

Velocity is the speed at which the data is generated and processed. Big data is produced more continually than small data.

For example, the Large Hadron Collider at CERN in Geneva is packed with 150 million sensors to capture the tiniest of elements.

These sensors deliver data at 40 million times per second, with 600 million collisions per sound.

After filtering streams, there are 100 collisions of interest per second.

Veracity is the quality and value of the data.

Poor data quality results in inaccurate analysis.

It took 10 years to decode the human genome.

That same process can be done now in less than a day.

Google’s DNAStack compiles and organizes DNA samples of genetic data from around the world to identify diseases and other medical defects.

They allow scientists to scale social experiments that would usually take scientists years, instantly.

Chris Anderson is the Editor of Wired Magazine. He advocates the idea of using math first and establishing a context later.

He believes that the scientific method is becoming obsolete. Google conquered advertising through math. It translates languages without knowing them.

China’s IJOP system is a social credit system.

While some are rewarded, a lot are not.

Six million Chinese citizens have been blacklisted from flying as a consequence of social misdeeds.

We have data collected about us all the time without our knowledge. Cambridge Analytica showed that during the 2016 US Presidential Election.

And yet, 20 percent of Americans aren’t online.

The more excited we get about big data, the more decisions we will base on web generated content.

We need to remind ourselves of that 20 percent without a presence.

They are being ignored. The majority of that 20 percent aren’t privileged people opting out of a digital ecosystem.

These are poor people that can’t afford a monthly subscription.

They are people that don’t have access to reliable internet connections in rural areas.

They are the people that have been left out of the rapid advances and privileges of technology.

Ignoring them only exacerbates existing inequalities.

But if you’re listening to this, you most likely are part of the 80 percent that do have access.

You have data collected from you all the time.

Shouldn’t you be aware of the purpose of that data collection?

How will that data be used?

Who will be able to mine the data and use it?

What’s the status of security surrounding access to the data?

How can that collected data be updated?

And what about your privacy, do you value that?

The Privacy Paradox is a phenomenon where online users state that they are concerned about their privacy but then behave as if they are not.

We all underestimate the harm of disclosing information online.

The main explanation is that we lack awareness of the risks and the degree of protection we have.

Some researchers argue that the paradox comes from a lack of technology literacy as well as the design of the sites.

For example, in 2006, AOL mistakenly posted 20 million search queries by 658,000 users over 3 months.

The New York Times combed through some of the search results which were seemingly anonymous.

They discovered user 4417749 and identified her as Thelma Arnold, a 62-year-old woman living in Georgia.

You’re not that anonymous.

While protocols for research dictate no personally identifiable information should exist in the dataset,it’s hard to actually anonymize data.

In 2009, researchers at Carnegie Mellon University published a study that showed it is possible to predict most and sometimes all of an individual’s 9-digit Social Security number using information gleaned from social networks and online databases.

Cases have appeared of users who have photographs stolen from social networking sites in order to assist in identity theft.

Preteens and early teenagers are particularly susceptible to social pressures that encourage revealing personal data when posting online.

The internet is a hunting ground for predators.

A number of highly publicized cases have demonstrated that threat.

Peter Chapman, under a false name, added over 3000 friends on Facebook. Then went on to rape and murder a Ashleigh Hall, 17-year-old girl.

Facebook? They directly responded to the killing.

They said it warned under-18 users not to meet people from the internet.

It also gave advice on how to be safe online.

Supposedly, they were also deeply saddened.

63% of Facebook profiles are visible to the public, meaning if you Google someone’s name and add +Facebook you will probably see most of that person’s profile.

The FBI has dedicated undercover agents on Facebook, Twitter, MySpace and even LinkedIn.

In 2017, the Department of Homeland Security began using social media platforms to screen immigrants arriving in the U.S.

It’s now common for law enforcement to go undercover on social networks.

As of 2008, CareerBuilder.com estimated that one in five employers search social networking sites to screen potential candidates.

41% of managers considered information relating to candidates’ alcohol and drug use to be a top concern. But In certain jurisdictions, it’s illegal to screen candidates this way.

One last story about an angry father.

The father goes to Target and shows the manager a coupon.

He tells the manager that it was sent to his teenage daughter. The ad is for baby clothes and cribs. Why would Target be sending his teenage daughter ads for a pregnant woman?

The father returns home and talks to his daughter.

She’s pregnant, due in August.

Target assigns every customer a guest ID.

It’s tied to their credit card, name, email, everything you’ve bought along with demographic information.

Using that data, a Target statistician noticed women on the baby registry buying unscented lotion around the second trimester of their pregnancy.

25 products analyzed together allowed Target to assign a “pregnancy prediction score.”

How comfortable are you with technology predicting your behavior?

2021 NERDLab