A guide to deploying and administering Hadoop Clusters like a smart administrator

Image by Free-Photos from Pixabay

Several years ago while I was still a developer I got chosen to be part of the prestigious Data Administration group. It was a dream come true because not many got that chance. While I was preparing myself to be part of the elite the Principal DBA gave me very sound advice.

To be truthful, the advice did not make too much sense to me at that point…maybe I…


A journey into the evolution of Big Data Compute Platforms like Hadoop and Spark. Sharing my perspective on where we were, where we are and where we are headed.

Image by Gerd Altmann from Pixabay

Over the past few years I have been part of a large number of Hadoop projects. Back in 2012–2016 the majority of our work was done using on-premises Hadoop infrastructure.

On a typical project we would take care of every aspect of the Big Data pipeline including Hadoop node procurement, deployment, pipeline development and administration. Back in those days Hadoop was not as mature as it is now, so in some cases we had to jump hoops in order to get things done. Lack of proper documentation and expertise made things even more difficult.


A guide to implement effective Kafka Clusters Design Strategies using Partitioning and Replication

Image by Tumisu from Pixabay

At the lowest denomination level may be created with a single broker instance. Using a Kafka Producer Data Stream can be sent in form of messages to the Kafka Broker. These messages stay on the Kafka Broker for a configurable period of time until a Kafka Consumer can retrieve and process them.


Learn how to track real-time gold prices using of Apache Kafka Pandas. Plot latest prices on a Bar Chart.

Photo by Chris Liverani on Unsplash

Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our Reader Terms for details.

We are in the era where tracking, processing and analyzing real-time data is becoming a necessity of many businesses. Needless to say handling streaming data sets is becoming one of the most crucial and sought of skills for Data Engineers and Scientists.

For this article I am assuming that you are familiar with Apache…


Easily process data changes over time from your database to Data Lake using Apache Hudi on Amazon EMR

Image by Gino Crescoli from Pixabay

In a previous article below we had discussed how to seamlessly collect CDC data using Amazon Database Migration Service (DMS).

https://towardsdatascience.com/data-lake-change-data-capture-cdc-using-amazon-database-migration-service-part-1-capture-b43c3422aad4

The following article will demonstrate how to process CDC data such that a near real-time representation of the your database is achieved in your data lake. We will use the combined power of of Apache Hudi and Amazon EMR to perform this operation. Apache Hudi is an open-source data management framework used to simplify incremental data processing in near real time.

We will kick-start the process by creating a new EMR Cluster

$ aws emr create-cluster --auto-scaling-role EMR_AutoScaling_DefaultRole --applications…


Easily capture data changes over time from your database to Data Lake using Amazon Database Migration Service (DMS)

Image by Gino Crescoli from Pixabay

Over my past 10 years spent in the Big Data and Analytics world, I have come to realize that capturing and processing change data sets has been a challenging area. Through all these years I have seen how CDC has evolved. Let me take you through the journey:

Year 2011–2013 — For many, Hadoop is the major Data Analytics Platform. Typically, Sqoop was used to transfer data from a given database to HDFS. This worked pretty well for full table loads. Sqoop incremental could capture inserts as well.

But CDC is not only about inserts. …


How we were able to auto-scale an Optical Character Recognition Pipeline to convert thousands of PDF documents into Text per day using event driven microservices architecture driven by Docker and Kubernetes

Image by mohamed Hassan from Pixabay

On a recent project we were called in to create a pipeline that has the ability to convert PDF documents to text. The incoming PDF documents were typically 100 pages and could contain both typewritten and handwritten text. These PDF documents were uploaded by users to an SFTP. Normally, on average there would be 30–40 documents per hour, but as high as 100 during peak periods. Since their business was growing the client expressed a need to OCR up to a thousand documents per day. These documents were then fed into an NLP pipeline for further analysis.

Let's do a Proof of Concept — Our Findings

Time to convert…


Easily create Spark ETL jobs using AWS Glue Studio — no Spark experience required

Image by Gerd Altmann from Pixabay

AWS Glue Studio was launched recently. With AWS Glue Studio you can use a GUI to create, manage and monitor ETL jobs without the need of Spark programming skills. Users may visually create an ETL job by visually defining the source/transform/destination nodes of an ETL job that can perform operations like fetching/saving data, joining datasets, selecting fields, filtering etc. Once a user assembles the various nodes of the ETL job, AWS Glue Studio automatically generates the Spark Code for you.

AWS Glue Studio supports many different types of data sources including:

  • S3
  • RDS
  • Kinesis
  • Kafka

Let us try to create…


Performance Comparison of well known Big Data Formats — CSV, JSON, AVRO, PARQUET & ORC

Photo by Mika Baumeister on Unsplash

For the past several years, I have been using all kinds of data formats in Big Data projects. During this time I have strongly favored one format over other — my failures have taught me a few lessons. During my lectures I keep stressing the importance of using the correct Data Format for the correct purpose — it makes a world of a difference.

All this time I have wondered whether I am delivering the right knowledge to my customers and students. Can I support my claims using data? Therefore I decided to do this performance comparison.

Before I start…


Comparison of two known engines for optical character recognition (OCR) and Naturtal Language Processing

Image by Felix Wolf from Pixabay

What is OCR anyway and why the buzz? Artificial Intelligence (AI) enables entities with Human Intelligence (us) process data at a large scale — faster and cheaper. Unarguably, a large portion of data is saved digitally- easy to read and analyze. However there is a significant portion of data that is stored in physical documents — both type written and hand-written. How to analyze this category of data. This is where fascinating technology of Optical Character Recognition (OCR) comes in. Using OCR you are able to convert documents into text format of data suitable for editing and searching. …

Manoj Kukreja

Big Data Engineering, Data Science, Data Lakes, Cloud Computing and IT security specialist.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store