An overview on Big Data

What Is Big Data

  • A collection of data sets so large and complex that it becomes difficult to process for capturing, storing, managing and analyzing using typical database management tools or traditional data processing applications.
  • It refers to data set grows so large and complex in a twinkle of an eye that it becomes difficult to capture, store, manage, analyze and visualize in current computational architecture.

Why Big Data

  • Existing tools are becoming inadequate to process large scale of data set
  • We may need to process data on the order of gigabytes to terabytes or petabytes
  • Increasing the storage capacity
  • Increasing the processing power
  • Increasing the computation power

Big Data is Everywhere

Lots of data are collected and warehoused in the following areas

  • Science experiments or inventions
  • Web Data, e-Commerce etc.
  • Purchases and various stores
  • Various Bank Transactions
  • Social Networks

Characteristics of Big Data

  • Volume
    • refers to the gross amount of data that has to be analyzed
  • Velocity
    • the rate at which data is captured
  • Variety
    • different ways data are structured
    • data can be in various formats such as text, image, video, audio etc.
  • Variability
    • inconsistency in data at times, thus hampering the process of handle and manage data effectively
  • Complexity
    • data management becomes very complex when data are coming from multiple sources
    • data need to be linked, connected and correlated in order to grasp the information

Big Data Analytics

  • A process of inspecting, cleaning, transforming, analyzing and modeling Big Data with the goal of discovering useful information, suggesting conclusions and supporting decision making
  • Analytics are not concerned with individual analysis step rather concerned with the whole methodology

Challenges and Objectives

  • Challenges include capturing, storing, curating, transferring, searching, sharing, analyzing and visualizing
  • Big Data is not just about size
    • It finds insights from complex, noisy, heterogeneous, longitudinal and voluminous data
    • Its objective is to answer to previously unanswered question

Big Data Issues

  • Data issues
    • High dimensionality
    • Structure
    • Noise
    • Incomplete Data
    • Concept drift
    • Domain adaption
    • Background knowledge incorporation
  • Platform issues
    • Parallel computing
    • Distributed computing
  • User interaction
    • Crowd sourcing

Applications of Big Data

  • Social networks and social media
  • Speech recognition and identification
  • Financial engineering
  • Medical and healthcare informatics
  • Science and research

Tools used in Big data

  • NoSQL
    • MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, HyperTable, Voldemort, Riak, ZooKeeper
  • MapReduce
    • Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum
  • Storage
    • S3, Hadoop Distributed File System
  • Servers
    • EC2, Google App Engine, Elastic, Beanstalk, Heroku
  • Processing
    • R, Yahoo Pipes, Mechanical Turk, Solr/Lucene, Elastic Search, Datameer, BigSheets, Tinkerpop

I am a professional Web developer, Enterprise Application developer, Software Engineer and Blogger. Connect me on Roy Tutorials | TwitterFacebook Google PlusLinkedin | Reddit

Leave a Reply

Your email address will not be published. Required fields are marked *