search query: @keyword SQL / total: 36
reference: 3 / 36
« previous | next »
Author:Xue, Rui
Title:SQL Engines for Big Data Analytics
SQL hakukone isoa datan analyysia varten
Publication type:Master's thesis
Publication year:2015
Pages:56+7      Language:   eng
Department/School:Perustieteiden korkeakoulu
Main subject:Ohjelmistotekniikka   (T3001)
Supervisor:Heljanko, Keijo
Instructor:Heljanko, Keijo
Electronic version URL: http://urn.fi/URN:NBN:fi:aalto-201512165719
Location:P1 Ark Aalto  3262   | Archive
Keywords:hadoop
SQL
interactive analysis
hive
spark
spark SQL
Abstract (eng):The traditional relational database systems can not accommodate the need of analyzing data with large volume and various formats, i.e., Big Data.
Apache Hadoop as the first generation of open-source Big Data solution provided a stable distributed data storage and resource management system.
However, as a MapReduce framework, the only channel of utilizing the parallel computing power of Hadoop is the API.
Given a problem, one has to code a corresponding MapReduce program in Java, which is time consuming.
Moreover, Hadoop focuses on high throughput rather than low latency.
Therefore, Hadoop can be a poor fit for interactive data processing.
For instance, recently more and more DNA genomic sequence data is generated, and processing the genomic sequences in a single standalone system is next to impossible.
But the genomic researchers usually major in their own field rather than programming and they definitely do not expect the long wait until they get their interested data.

The demand of interactive Big Data processing necessitated decoupling of data storage from analysis.
The simple SQL queries of traditional relational database systems is still the most practical analyzing tool that people without programming background can also benefit from.
As a result, Big Data SQL engines have been spun off in the Hadoop Ecosystem.

This thesis first discusses the variety of Big Data storage formats and introduces Hadoop as the compulsory background knowledge.
Then chapter three introduced three Hadoop-based SQL engines, i.e., Hive, Spark, and Impala, and focused on the first two, currently the most popular ones.
In order to have deeper understanding of those SQL engines, an SQL benchmark experiment on Hive and Spark was executed with BAM data, which a binary genomic data format, as input and presented in this thesis.
Finally, conclusion about Hadoop-based SQL engines is given.
ED:2016-01-17
INSSI record number: 52842
+ add basket
« previous | next »
INSSI