Excerpts from "Big Data"

I've collected interesting figures, facts, stories, and quotes from the book "Big Data: A Revolution That Will Transform How We Live, Work, and Think" by Viktor Mayer-Schonberger and Kenneth Cukier.
In 1986, 40% of the world's computing power was in pocket calculators.
Walmart and CapitalOne were the first to use big data in retail and banking.
Computing algorithms improved approximately 43,000 times between 1988 and 2003 — significantly more than processors in accordance with Moore's Law.
Simple models with lots of data outperform more complex models.
Using Hadoop, Visa managed to reduce the processing time of test records accumulated over two years (73 billion transactions) from one month to 13 minutes.
PriceStats scans up to 5,000,000 prices on goods from 300 retailers in 70 countries to detect inflationary fluctuations in real time and sells the results to banks and investment funds.
In collaboration with Teradata, Walmart identified interesting correlations — before a hurricane, sales of flashlights and Pop-Tarts, as well as sweet breakfast cereals, increased.
Insurance company Aviva analyzes data on hobbies, websites visited, and time spent watching TV to identify individuals at risk of developing high blood pressure, diabetes, or depression.
A popular purchase among pregnant women around the third month of pregnancy is unscented lotion. A few months later, women typically bought dietary supplements.
IBM and Microsoft are collaborating with hospitals to develop software that receives and processes patient condition data in real time. They are used to make diagnostic decisions. The system tracks 16 different data streams, such as heart rate, respiratory rate, temperature, blood pressure, and blood oxygen levels, totaling about 1,260 data points per second.
At income levels below $10,000, each increase led to a feeling of happiness, but income growth above this level changed little. Focus should be on increasing the incomes of the poor, since, as the data showed, this would yield greater returns on investment.
The probability of breakdowns in cars painted orange is much lower (approximately half) than among other cars.
Since the mid-15th century, 129 million different books have been published. By 2010, five years after launching its book project, Google had scanned over 15 million titles — a significant part of the world's written heritage (over 12%). This gave rise to a new academic discipline — culturomics. It is a computational lexicology that attempts to understand human behavior and cultural trends through quantitative analysis of texts.
Google's Street View cars, which take panoramic photos of streets, also collect information about Wi-Fi network routers.
The iPhone collects location data and information about Wi-Fi networks; additionally, Google Android and Microsoft collect similar data.
AirSage processes three billion geolocation data records to create real-time road reports in 100 cities across America. Sense Networks and Skyhook, having location data, report which areas of a city have more vibrant nightlife or how many protesters have gathered at a demonstration.
Two hedge funds — Derwent Capital in London and MarketPsych in California — began analyzing datified text from tweets as signals for stock market investments.
The company Asthmapolis attaches sensors to asthma inhalers that track location using GPS. The collected information helps determine which environmental factors trigger asthma attacks.
In 2009, Apple filed a patent application for collecting data on blood oxygen saturation, heart rate, and body temperature through earbud headphones.
Luis von Ahn created Captcha. Five years later, about 200 million Captchas were being entered daily. Luis von Ahn was looking for ways to more productively use human computing power. As a result, ReCaptcha was created. Now, instead of entering random letters, people type words from text scanning projects that couldn't be recognized by optical character recognition software.
Google and the Bank of Italy BVA launched a business forecasting service for analyzing the tourism sector and sell real-time economic indicators. The Bank of England works with search queries related to real estate to refine housing price trends.
Google resists calls to delete the full IP addresses of old search queries; instead, after 18 months, only the last four digits are removed to make the search query anonymous.
Nobody can yet say what the consequences of valuation models will be. But it is well known that the economy is beginning to form around data.
MasterCard Advisors consolidates and analyzes 65 billion transactions made by 1.5 billion cardholders in 210 countries to forecast consumer and business trends. This information is sold to other companies. Among other things, the company found that if people filled up their cars around 4 PM, within an hour they were likely to spend $35-50 at a grocery store or restaurant.
Inrix analyzes traffic. It combines real-time geolocation data on 100 million cars in the US and Europe. Data comes from BMW, Ford, and Toyota cars, commercial taxi and delivery fleets, and individual drivers' mobile phones (it's worth noting the important role of Inrix's free smartphone apps: users get free traffic information, and Inrix gets their coordinates). Inrix combines this information with traffic pattern data, weather information, and other factors (e.g., local events) to forecast traffic density. The finished "product" is transmitted to car satellite navigation systems and used by government agencies and commercial fleets.
In 2011, the US economic recovery program began to show cracks, despite politicians' claims to the contrary. Traffic analysis quickly revealed this: roads became more free during rush hour, suggesting an increase in unemployment. Inrix sold its data to an investment fund, which used traffic patterns around major retail chain stores to identify their sales volumes. The fund uses this data to trade company stocks before their quarterly earnings announcements. According to the correlation, the more cars in the store area, the higher its sales.
Coursera, a distance learning company, analyzes all the data exhaust it collects (e.g., which section of a video lecture students replayed) to identify possible unclear or particularly interesting points that should be considered in course development.
The-Numbers.com uses databases to tell Hollywood producers the likely revenue from a film long before the first take is shot. The company's database processes about 30 million records about every commercial US film over recent decades. The records contain information about budget, genre, cast, crew, awards, revenue (including US and international box office, foreign rights, video sales and rentals), and more.
The performance indicators of companies that have succeeded in making data-driven decisions are 6% higher than those that don't rely on data when making decisions.
The five-digit numbers tattooed on the forearms of prisoners in Nazi concentration camps corresponded to the numbers on IBM Hollerith punched cards. More details
Some "smart" electricity meters in the US and Europe can collect between 750 and 3,000 data points per month in real time. Each appliance has a unique "load signature" when drawing power, which makes it possible to distinguish a refrigerator from a TV, and a TV from a grow light for marijuana. Thus, electricity usage reveals personal information, whether it's daily habits, medical conditions, or unlawful behavior.
In 2006, AOL made public for research purposes 20 million old search queries from 650,000 users over six months. The dataset was carefully anonymized. Within days, New York Times staff linked search queries and de-anonymized some users. AOL's CTO and two employees were fired.
Netflix announced the Netflix Prize competition, released 100 million rental records from 500,000 users, and announced a $1 million prize for those who could improve the movie recommendation system. Personal identifiers were removed. Again, users were exposed. The ratings of anonymized users matched the ratings of people with specific names on IMDb. With just six movie ratings, a client's identity could be correctly determined in 84% of cases. And knowing the date when a person rated movies, they could be identified with 99% accuracy within the dataset.
Sensors installed in most cars to track airbag activations are known to be able to "testify" against car owners in court in the event of a dispute over an accident. A modern car has about 60 sensors.
According to a Washington Post investigation in 2010, the NSA intercepts and stores 1.7 billion emails, phone calls, and other communications daily. According to estimates by William Binney, a former NSA employee, the government has collected 20 trillion transactions between American and other citizens: who called whom, emailed whom, sent money to whom, etc.
Parole boards in thirty states use predictions based on data analysis as a factor in deciding whether to release a particular prisoner.
In the US, police in certain states use big data analysis to select streets, groups, and individuals for additional scrutiny, providing data on potential threat zones in terms of location (within a few blocks) and time (within a few hours of a specific day of the week). Police establish correlations between crime data and additional datasets, such as payday dates and event dates. For example, police hypothesized that gun shows are followed by an increase in crime. Analysis proved them right, but with one caveat — the crime spike occurred two weeks after the event, not immediately afterward.
New York City utility companies use big data analysis.
McNamara felt that he could understand what was happening on the ground only by staring at a spreadsheet — all those neat rows and columns, calculations and graphs, having mastered which he would, it seemed, become one standard deviation closer to God.