Several years ago while I was still a developer I got chosen to be part of the prestigious Data Administration group. It was a dream come true because not many got that chance. While I was preparing myself to be part of the elite the Principal DBA gave me very sound advice.
No one ever cares about all the good things you will do but the first bad thing will get highlighted at the top most level…..such is the life of an administrator
To be truthful, the advice did not make too much sense to me at that point…maybe I…
Over the past few years I have been part of a large number of Hadoop projects. Back in 2012–2016 the majority of our work was done using on-premises Hadoop infrastructure.
The age of on premises clusters…..
On a typical project we would take care of every aspect of the Big Data pipeline including Hadoop node procurement, deployment, pipeline development and administration. Back in those days Hadoop was not as mature as it is now, so in some cases we had to jump hoops in order to get things done. Lack of proper documentation and expertise made things even more difficult.
At the lowest denomination level may be created with a single broker instance. Using a Kafka Producer Data Stream can be sent in form of messages to the Kafka Broker. These messages stay on the Kafka Broker for a configurable period of time until a Kafka Consumer can retrieve and process them.
Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our Reader Terms for details.
We are in the era where tracking, processing and analyzing real-time data is becoming a necessity of many businesses. Needless to say handling streaming data sets is becoming one of the most crucial and sought of skills for Data Engineers and Scientists.
For this article I am assuming that you are familiar with Apache…
In a previous article below we had discussed how to seamlessly collect CDC data using Amazon Database Migration Service (DMS).
The following article will demonstrate how to process CDC data such that a near real-time representation of the your database is achieved in your data lake. We will use the combined power of of Apache Hudi and Amazon EMR to perform this operation. Apache Hudi is an open-source data management framework used to simplify incremental data processing in near real time.
We will kick-start the process by creating a new EMR Cluster
$ aws emr create-cluster --auto-scaling-role EMR_AutoScaling_DefaultRole --applications…
Over my past 10 years spent in the Big Data and Analytics world, I have come to realize that capturing and processing change data sets has been a challenging area. Through all these years I have seen how CDC has evolved. Let me take you through the journey:
Year 2011–2013 — For many, Hadoop is the major Data Analytics Platform. Typically, Sqoop was used to transfer data from a given database to HDFS. This worked pretty well for full table loads. Sqoop incremental could capture inserts as well.
But CDC is not only about inserts. …
On a recent project we were called in to create a pipeline that has the ability to convert PDF documents to text. The incoming PDF documents were typically 100 pages and could contain both typewritten and handwritten text. These PDF documents were uploaded by users to an SFTP. Normally, on average there would be 30–40 documents per hour, but as high as 100 during peak periods. Since their business was growing the client expressed a need to OCR up to a thousand documents per day. These documents were then fed into an NLP pipeline for further analysis.
Time to convert…
AWS Glue Studio was launched recently. With AWS Glue Studio you can use a GUI to create, manage and monitor ETL jobs without the need of Spark programming skills. Users may visually create an ETL job by visually defining the source/transform/destination nodes of an ETL job that can perform operations like fetching/saving data, joining datasets, selecting fields, filtering etc. Once a user assembles the various nodes of the ETL job, AWS Glue Studio automatically generates the Spark Code for you.
AWS Glue Studio supports many different types of data sources including:
Let us try to create…
For the past several years, I have been using all kinds of data formats in Big Data projects. During this time I have strongly favored one format over other — my failures have taught me a few lessons. During my lectures I keep stressing the importance of using the correct Data Format for the correct purpose — it makes a world of a difference.
All this time I have wondered whether I am delivering the right knowledge to my customers and students. Can I support my claims using data? Therefore I decided to do this performance comparison.
Before I start…
What is OCR anyway and why the buzz? Artificial Intelligence (AI) enables entities with Human Intelligence (us) process data at a large scale — faster and cheaper. Unarguably, a large portion of data is saved digitally- easy to read and analyze. However there is a significant portion of data that is stored in physical documents — both type written and hand-written. How to analyze this category of data. This is where fascinating technology of Optical Character Recognition (OCR) comes in. Using OCR you are able to convert documents into text format of data suitable for editing and searching. …
Big Data Engineering, Data Science, Data Lakes, Cloud Computing and IT security specialist.