【Big-Data】框架和概念学习

 

Concept Collection

Category Concept Explanation Pros Cons
General Platform Spark A unified analytics engine for large-scale data processing.
Spark: Spark
Spark Explanation: Spark explanation
Spark is a fast and resilient large distributed data processing platform.
 
  Flink Flink is a framework and distributed processing engine for stateful computations over bounded and unbounded data.
Arch Intro: Flink - Arch
Use cases: Event-driven applications, Data analytics applications, data pipeline applications.
Read event-log in a real time manner. The arch stores and processes data locally and update persistant remote storage periodically. In the meantime, it streams event to downstream.  
  Hadoop Hadoop is a framework the stores process and analyze data which are very huge in column.
Intro: Hadoop
   
Storage HDFS HDFS is a distributed file system that stores very large files. 1. Stores very large files.
2. Streaming data access(write-once, read-many-times)
3. Cheap hardware
1. Low latency data access.
2. Lots of small files
3. Multiple writes
Resource Scheduling Yarn Yarn - Yet Another Resourse Negotiator.
Yarn provides a generic and flexible framework to administer the computing resources in the Hadoop cluster.
   
  Mesos Mesos is a platform for sharing commodity clusters between multiple diverse cluster computing frameworks, such as Hadoop and MPI.
Multiplexing a cluster between frameworks is the main use case of mesos.
White paper: Paper
It could perform fine-grained resource sharing across diverse clustering computing frameworks.  
Data Analysis Pig Pig is an SQL-like language that could be compiled to a series of Map-reduce operations that could be optimized to execute.    
  Hive The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL.    
  Kylin Apache Kylin™ is an open source, distributed Analytical Data Warehouse for Big Data; it was designed to provide OLAP (Online Analytical Processing) capability in the big data era. Could be used in OLAP, near realtime.  
  Spark SQL Spark SQL could query structured data inside Spark programs, using eigher SQL or other programming languages such as Python, Java. etc.
More: Spark
   
  Spark Dataframe A Dataframe is a dataset organised into named columns. It is conceptually equivalent to a table in a relational database or a data frame in python, but with a richer optimization under the hood.
Reference: Spark Dataframe
   
  Impala Impala is an Apache native arch that circumvents MapReduce operations to directly access the data through a specialized distributed data query engine.
Reference: Impala
   
  Elastic Stack