search query: @keyword SQL / total: 36
reference: 9 / 36
« previous | next »
Author:Niemenmaa, Matti
Title:Analysing sequencing data in Hadoop: The road to interactivity via SQL
Publication type:Master's thesis
Publication year:2013
Pages:xv + 143      Language:   eng
Department/School:Perustieteiden korkeakoulu
Main subject:Tietojenkäsittelyteoria   (T-79)
Supervisor:Heljanko, Keijo
Instructor:Heljanko, Keijo
Electronic version URL: http://urn.fi/URN:NBN:fi:aalto-201312198156
OEVS:
Electronic archive copy is available via Aalto Thesis Database.
Instructions

Reading digital theses in the closed network of the Aalto University Harald Herlin Learning Centre

In the closed network of Learning Centre you can read digital and digitized theses not available in the open network.

The Learning Centre contact details and opening hours: https://learningcentre.aalto.fi/en/harald-herlin-learning-centre/

You can read theses on the Learning Centre customer computers, which are available on all floors.

Logging on to the customer computers

  • Aalto University staff members log on to the customer computer using the Aalto username and password.
  • Other customers log on using a shared username and password.

Opening a thesis

  • On the desktop of the customer computers, you will find an icon titled:

    Aalto Thesis Database

  • Click on the icon to search for and open the thesis you are looking for from Aaltodoc database. You can find the thesis file by clicking the link on the OEV or OEVS field.

Reading the thesis

  • You can either print the thesis or read it on the customer computer screen.
  • You cannot save the thesis file on a flash drive or email it.
  • You cannot copy text or images from the file.
  • You cannot edit the file.

Printing the thesis

  • You can print the thesis for your personal study or research use.
  • Aalto University students and staff members may print black-and-white prints on the PrintingPoint devices when using the computer with personal Aalto username and password. Color printing is possible using the printer u90203-psc3, which is located near the customer service. Color printing is subject to a charge to Aalto University students and staff members.
  • Other customers can use the printer u90203-psc3. All printing is subject to a charge to non-University members.
Location:P1 Ark Aalto     | Archive
Keywords:hive
shark
impala
hadoop
mapreduce
HDFS
SQL
sequencing data
big data
interactive analysis
Abstract (eng): Analysis of high volumes of data has always been performed with distributed computing on computer clusters.
But due to rapidly increasing data amounts in, for example, DNA sequencing, new approaches to data analysis are needed.
Warehouse-scale computing environments with up to tens of thousands of networked nodes may be necessary to solve future Big Data problems related to sequencing data analysis.
And to utilize such systems effectively, specialized software is needed.

Hadoop is a collection of software built specifically for Big Data processing, with a core consisting of the Hadoop MapReduce scalable distributed computing platform and the Hadoop Distributed File System, HDFS.
This work explains the principles underlying Hadoop MapReduce and HDFS as well as certain prominent higher-level interfaces to them: Pig, Hive, and HBase.
An overview of the current state of Hadoop usage in bioinformatics is then provided alongside brief introductions to the Hadoop-BAM and SeqPig projects of the author and his colleagues.

Data analysis tasks are often performed interactively, exploring the data sets at hand in order to familiarize oneself with them in preparation for well targeted long-running computations.
Hadoop MapReduce is optimized for throughput instead of latency, making it a poor fit for interactive use.
This Thesis presents two high-level alternatives designed especially with interactive data analysis in mind: Shark and Impala, both of which are Hive-compatible SQL-based systems.

Aside from the computational framework used, the format in which the data sets are stored can greatly affect analytical performance.
Thus new file formats are being developed to better cope with the needs of modern and future Big Data sets.
This work analyses the current state of the art storage formats used in the worlds of bioinformatics and Hadoop.

Finally, this Thesis presents the results of experiments performed by the author with the goal of understanding how well the landscape of available frameworks and storage formats can tackle interactive sequencing data analysis tasks.
ED:2013-12-18
INSSI record number: 48233
+ add basket
« previous | next »
INSSI