What Is Big Data? What Issues Companies Are Facing and How It Can Be Resolved Using Big Data.

Abhishek Chouhan
14 min readFeb 24, 2021

Heyy EveryOne ..!!

In this article, you will come to know…what is Hadoop, Big Data & Distributed Storage?How Top MNC's Are Using Big Data ?what issues companies are facing related to data?and how it can be resolved using big data.
Source :- Google
DataDistributed StorageHadoopBig Data
Source :- Google

Data are characteristics or information, usually numerical, that are collected through observation. In a more technical sense, data are a set of values of qualitative or quantitative variables about one or more persons or objects, while a datum is a single value of a single variable.single variable. we can say that we have many data like files, audio, videos, messages etc.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

The base Apache Hadoop framework is composed of the following modules:

  • Hadoop Common — contains libraries and utilities needed by other Hadoop modules;
  • Hadoop Distributed File System (HDFS) — a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;
  • Hadoop YARN — (introduced in 2012) a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications;[10][11]
  • Hadoop MapReduce — an implementation of the MapReduce programming model for large-scale data processing.

A distributed storage system is infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes. A distributed object store is made up of many individual object stores, normally consisting of one or a small number of physical disks. These object stores run on commodity server hardware, which might be the compute nodes or might be separate servers configured solely for providing storage services. As such, the hardware is relatively inexpensive. The disk of each virtual machine is broken up into a large number of small segments, typically a few megabytes in size each, and each segment is stored several times (often three) on different object stores. Each copy of each segment is called a replica. The system is designed to tolerate failure. As relatively inexpensive hardware is used, failure of individual object stores is comparatively frequent; indeed, with enough object stores, failure becomes inevitable. However, as it would require every replica to become unavailable for data to be lost, failure of individual object stores is not an ‘emergency event’ requiring call-out of storage engineers, but something handled through routine maintenance. Performance does not noticeably degrade, and the under-replicated data is gradually and automatically re-replicated from existing replicas. There is no ‘re-silvering’ operation to perform when the defective object store is replaced in the same way that would happen with a replacement RAID disk.

Distributed storage systems can store several types of data:

  • Files — a distributed file system allows devices to mount a virtual drive, with the actual files distributed across several machines.
  • Block storage — a block storage system stores data in volumes known as blocks. This is an alternative to a file-based structure that provides higher performance. A common distributed block storage system is a Storage Area Network (SAN).
  • Objects — a distributed object storage system wraps data into objects, identified by a unique ID or hash.

Distributed storage systems have several advantages:

  • Scalability — the primary motivation for distributing storage is to scale horizontally, adding more storage space by adding more storage nodes to the cluster.
  • Redundancy — distributed storage systems can store more than one copy of the same data, for high availability, backup, and disaster recovery purposes.
  • Cost — distributed storage makes it possible to use cheaper, commodity hardware to store large volumes of data at low cost.
  • Performance — distributed storage can offer better performance than a single server in some scenarios, for example, it can store data closer to its consumers, or enable massively parallel access to large files.

Distributed Storage Features and Limitations

Most distributed storage systems have some or all of the following features:

  • Partitioning — the ability to distribute data between cluster nodes and enable clients to seamlessly retrieve the data from multiple nodes.
  • Replication — the ability to replicate the same data item across multiple cluster nodes and maintain consistency of the data as clients update it.
  • Fault tolerance — the ability to retain availability to data even when one or more nodes in the distributed storage cluster goes down.
  • Elastic scalability — enabling data users to receive more storage space if needed, and enabling storage system operators to scale the storage system up and down by adding or removing storage units to the cluster.

Big Data

Big data refers to the large, diverse sets of information that grow at ever-increasing rates. It encompasses the volume of information, the velocity or speed at which it is created and collected, and the variety or scope of the data points being covered. Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. Actually BigData referes to huge amount of data which big companies like Google, Facebook etc. Are getting nowadays everyday hunderds of TB (TeraByte) data Everyday. and how they are managing this much amount of data. It is possible because they are using bigdata .

Types of Big Data

Let’s discusss about the types

Structured

The term structured data generally refers to data that has a defined length and format for big data. Examples of structured data include numbers, dates, and groups of words and numbers called strings. … Structured data is the data you’re probably used to dealing with. It’s usually stored in a database.

Unstructured

Unstructured data is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well.

Semi-structured

Semi-structured data is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contain tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data.

Characteristics of Big Data

Their Are many characteristics of big data .

Variety

Variety in Big Data refers to all the structured and unstructured data that has the possibility of getting generated either by humans or by machines. … However, unstructured data like emails, voicemails, hand-written text, ECG reading, audio recordings etc, are also important elements under Variety.

Velocity

Velocity we refer to the speed with which data are being generated. Staying with our social media example, every day 900 million photos are uploaded on Facebook, 500 million tweets are posted on Twitter, 0.4 million hours of video are uploaded on Youtube and 3.5 billion searches are performed in Google.

Volume

Big data is about volume. Volumes of data that can reach unprecedented heights in fact. It’s estimated that 2.5 quintillion bytes of data is created each day, and as a result, there will be 40 zettabytes of data created by 2020 — which highlights an increase of 300 times from 2005.

Why is Big Data Important?

The ability to consistently get business value from data is now a trait of successful organizations across every industry, and of every size. In some industries (such as Retail, Advertising, and Financial Services, with more constantly joining the list), it’s even a matter of survival.

Data analytics only returns more value when you have access to more data, so organizations across multiple industries have found big data to be a rich resource for uncovering profound business insights. And, because machine-learning models get more efficient as they are “trained” with more data, machine learning and big data are highly complementary.

How will I come to know if my data is “big” ..?

Although many enterprises have yet to reach petabyte scale with respect to data volumes, it is possible that data has one of the other two defining characteristics of big data. And, if there is any single guarantee, it’s that your data will grow over time — probably, exponentially. In that sense, all “big data” starts as “small data.”

How much data Facebook getting everyday .

Facebook revealed some big, big stats on big data to a few reporters at its HQ today, including that its system processes 2.5 billion pieces of content and 500+ terabytes of data each day. It’s pulling in 2.7 billion Like actions and 300 million photos per day, and it scans roughly 105 terabytes of data each half hour.

Star Performers Working Behind Facebook’s Big Data

There is a combined workforce of people and technology constantly working behind the successful implementation of this platform. Though the platform is continuously being enriched, below are the prime technological aspects:

Big Data tools and what they do for us .

Hadoop

Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation

Scuba

Scuba is a distributed in-memory database built at Facebook. It is a time-series data analysis database aimed towards serving real-time analytical queries approximately. Scuba aims to keep data ingestion latency low and handles huge data inflow by expelling old data from the memory

Cassandra

Apache Cassandra is a free and open-source, distributed, wide column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure

Hive

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.

Prism

PRISM is unstructured big data aggregation framework — audio and video chats, phone calls, photographs, e-mails, documents, internet searches, Facebook Posts, mobile logs and connection logs — and relevant analytics that enable analysts to extract patterns.and some more

Google: How Big Data is at the Heart of Google’s Business Model

Google are probably responsible for introducing people to the benefits of analysing and interpreting Big Data in their day‐to‐day lives. This chapter explains how Big Data is at the heart of Google’s business model. Google uses the data from its Web index to initially match queries with potentially useful results. This is augmented with data from trusted sources and other sites that have been ranked for accuracy by machine‐learning algorithms designed to assess the reliability of data. Google monetized their search engine by working out how to capture the data it collects from us as we browse the Web, building up vast revenues by becoming the biggest sellers of online advertising in the world. Then they used the huge resources they were building up to rapidly expand, identifying growth areas such as mobile and Internet of Things in which to also apply their data‐driven business model.

Google: 40,000 Google Web Searches Per Second

More than 3.7 billion humans now have regular access to and use the internet. That results in about 40,000 web searches per second :- on Google alone. Furthermore, over half of all those web searches take place on mobile devices. It is likely the web search totals will continue to grow as more and more people get their hands on mobile devices across the world.

Amazon: Using Big Data to understand customers

Amazon has thrived by adopting an “everything under one roof” model. However, when faced with such a huge range of options, customers can often feel overwhelmed. They effectively become data-rich, with tons of options, but insight-poor, with little idea about what would be the best purchasing decision for them.

To combat this, Amazon uses Big Data gathered from customers while they browse to build and fine-tune its recommendation engine. The more Amazon knows about you, the better it can predict what you want to buy. And, once the retailer knows what you might want, it can streamline the process of persuading you to buy it — for example, by recommending various products instead of making you search through the whole catalogue.

Amazon’s recommendation technology is based on collaborative filtering, which means it decides what it thinks you want by building up a picture of who you are, then offering you products that people with similar profiles have purchased.

The technical details

Amazon collects data from users as they navigate the site, such as the time spent browsing each page. The retailer also makes use of external datasets, such as census data for gathering demographic details. Amazon’s core business is handled in its central data warehouse, which consists of Hewlett-Packard servers running Oracle on Linux.

How Apple is Using Big Data

Apple is often on the cutting edge of technological advances, so it probably shouldn’t be a surprise that the company uses big data extensively. Having said that, it’s important to note that it wasn’t always this way. Other businesses like Google were heavily involved in big data years before Apple took the leap, but Apple has worked tirelessly to catch up to the competition. Now, the company has become enmeshed in big data analytics, with the technology driving many of their most important decisions. It’s true that Apple remains highly secretive about how they use big data in many cases, but that hasn’t prevented some interesting insights from being divulged. By learning how Apple is using big data analytics, other companies can get a better view of how best to utilize the incredibly versatile technology.

Of course, Apple is in an advantageous position when compared to many other businesses. Their products are not only highly popular, but they’re already designed to capture valuable data. That’s how the company is able to gather information on customers relatively easily. It’s all that data that helps Apple determine how best to approach new products and services. One area in particular that has received a boost from big data analytics is application design. Applications are the useful tools many people have on their smartphones and tablets, and those tools can collect data on exactly how people use them. This is an important distinction to make, since in the past, designs were made intending to force people to use applications a certain way. Now, Apple can discover how people are using apps in real life and alter future designs to fit with customer tendencies.

How YouTube is Using Big Data

Big Data also offers a breakdown of viewers’ viewing habits such as how long people like to view ads, how long they watch shows for, what kind of content they skip and so on. … YouTube is a good example of this technology is leveraged to understand the audience produce content that appeals to them.

YouTube’s popularity is no secret as the website is worth over $26 billion. It is the website where everyone goes to check out the latest videos. In fact, all the world’s top brands have their own YouTube channel. To lessen the amount of time each user searches for the videos they prefer, YouTube uses big data. Here are a few ways they do it:

Recommended Channels

YouTube bases the recommended channels according to the videos you often go to. For example, if you often visit videos about basketball, you can expect the channels to be all about basketball tutorials, NBA highlights and player interviews.

How Netflix uses big data to create content and enhance user experience

An estimated 80% of content streamed on Netflix is influenced by its recommendation system. Stats/examples how shows like House of Cards keep users engaged.

With a 51 percent market share of the American streaming industry and over 148 million streaming subscribers worldwide as of Q4 2018, Netflix is certainly a force to be reckoned with.

More interestingly, Netflix is on track to be profitable. The chart below, courtesy of Statista, shows Netflix’s annual revenue from 2002 to 2018, and one thing is clear: Netflix is growing consistently and exponentially.

How Big Data is Making Instagram Stories More Effective

Instagram marketers can’t ignore the benefits of big data. Instagram CEO Kevin Systrom recently announced that big data is becoming a leading big data company.

Instagram isn’t the only company focused on the benefits of big data. The marketers that use the platform need to consider the implications of AI, machine learning and other data technology as well. Here are some of the benefits of big data in Instagram marketing:

  • Machine learning helps you gauge your audience’s response to different Instagram stories
  • AI tools use big data to create better content
  • Predictive analytics tools use machine learning to get a better understanding of the ROI of various time slots (this is something that many Instagram analytics tools consider) .

General Stats: Per Minute Ratings

Here are some of the per minute ratings for various social networks:

  • Snapchat: Over 527,760 photos shared by users
  • Amazon: $258,751.90 in Sales Per Minute
  • LinkedIn: Over 120 professionals join the network
  • YouTube: 4,146,600 videos watched
  • Twitter: 456,000 tweets sent or created
  • Instagram: 46,740 photos uploaded
  • Netflix: 69,444 hours of video watched
  • Giphy: 694,444 GIFs served

Advantages of Big Data

Benefits of Using Big Data Analytics

  • Identifying the root causes of failures and issues in real time.
  • Fully understanding the potential of data-driven marketing.
  • Generating customer offers based on their buying habits.
  • Improving customer engagement and increasing customer loyalty.
  • Reevaluating risk portfolios quickly.

Companies using Big Data

Here we look at some of the businesses integrating big data and how they are using it to boost their brand success.

  • Amazon. …
  • American Express. …
  • BDO. …
  • Capital One. …
  • General Electric (GE) …
  • Miniclip. …
  • Netflix. …

Big Data Problems You Need to Solve

  • Lack of Understanding. Companies can leverage data to boost performance in many areas. …
  • High Cost of Data Solutions. …
  • Too Many Choices. …
  • Complex Systems for Managing Data. …
  • Security Gaps. …
  • Low Quality and Inaccurate Data. …

Few Last Words

→ There Are Multiple Challenges can be solved by big data its an big thing in it self .

→ I will be sharing multiple blogs on big data, hadoop etc . In Upcoming days .

→ Follow me on Medium for more blogs on integration of new new tools and technologies .

For FurThur Queries, SuggesTion’s Feel Free tO Connect with me On Linkedin

If you like it then Clap & Share ..

Thank you EveryOne For reading .!!

--

--

Abhishek Chouhan

Technology Enthusiast Like to learn new new tools and technology and integrate them, DevOps, Cloud, MLOps, Kubernetes, AWS, Terraform, Expertise in Docker…