search query: @supervisor Nurminen, Jukka / total: 45
reference: 4 / 45
« previous | next »
Author:Ansari, Aftab
Title:Evaluation of cloud based approaches to data quality management
Publication type:Master's thesis
Publication year:2016
Pages:83      Language:   eng
Department/School:Perustieteiden korkeakoulu
Main subject:Service Design and Engineering   (IL3005)
Supervisor:Nurminen, Jukka
Instructor:Moloney, Seamus
Electronic version URL: http://urn.fi/URN:NBN:fi:aalto-201602161378
Location:P1 Ark Aalto  5683   | Archive
Keywords:data quality management
ETL
data cleaning
hive
hadoop
azure microsoft
Abstract (eng):Quality of data is critical for making data driven business decisions.
Enhancing the quality of data enables companies to make better decisions and prevent business losses.
Systems similar to Extract Transform and Load (ETL) are often used to clean and improve the quality of data.
Currently, businesses tend to collect a massive amount of customer data, store it in the cloud, and analyze the data to gain statistical inferences about their products, services, and customers.
Cheaper storage, constantly improving approaches to data privacy and security provided by cloud vendors, such as Microsoft Azure, Amazon Web Service, seem to be the key driving forces behind this process.

This thesis implements Azure Data Factory based ETL system that serves the purpose of data quality management in the Microsoft Azure Cloud platform.
In addition to Azure Data Factory, there are four other key components in the system: (1) Azure Storage for storing raw, and semi cleaned data; (2) HDInsight for processing raw and semi cleaned data using Hadoop clusters and Hive queries; (3) Azure ML Studio for processing raw and semi cleaned data using R scripts and other machine learning algorithms; (4) Azure SQL database for storing the cleaned data.
This thesis shows that using Azure Data factory as the core component offers many benefits because it helps in scheduling jobs, and monitoring the whole data transformation processes.
Thus, it makes data intake process more timely, guarantees data reliability, simplifies data auditing.
The developed system was tested and validated using sample raw data.
ED:2016-02-21
INSSI record number: 53170
+ add basket
« previous | next »
INSSI