This is a continuation in the series of posts on Stream Processing in big data applications. In the earlier posts I gave an Introduction To Real Time Stream Processing . I also established some of the Goals Of Real Time Stream Processing In Big Data.
There has been a massive development in the Internet Of Things. The result – big data is growing at a massive scale. IoT is generating real time events that is transforming the way companies engage their customers, and deliver their services. However, this requires them to deal with an enormous amount of heterogeneous data in motion. Stream processing of this moving data is lot more different than that of static data. It needs to make quick decisions . It needs to deliver insights without much time delay. Failure of doing so results in huge losses to the company and its customer base. The real time and large scale characteristics make processing of real time stream data a challenging task. In this post, we will discuss some of the challenges faced by real time stream processing systems.
1) Heterogeneity Of Data In Stream Processing
Data in real time streaming applications comes from a variety of sources. Data integration has to be carried out before any kind of processing. This is a big challenge for real time streaming applications. They have to cater to the requirement of high throughput . They also have to take care of data synchronization .What else? These systems have to minimize delays at the same time.
2) Data Quality Insurance In Stream Processing
In case of big data , it is a common phenomenon that useful information comes bundled with useless information. During the cleaning phase, if the granularity is kept too small, there are high chances that some useful information will get filtered out. On the other hand, if the granularity is too coarse, there might hardly be any use of filtering at all. So, between quantity and quality of data, we have to decide as to what is more important for a real time streaming application. One other major concern is that, the data that is not essential while filtering, may become critical post processing. Thus, getting accurate correlation between data, and figuring out the usefulness and effectiveness of data is a major challenge.
3) Lack Of Prior Knowledge
Unstructured and semi structured data is difficult to process, owing to the fact that schema is not known to the system beforehand. On top of that, it is difficult to gather enough information about the data prior to processing, because data comes in the form of endless streams.
4) Timeliness Of Data Processing In Stream Processing
In a big data application, time is of utmost importance. Any unreasonable delay caused , can be very hard to estimate. It can also cost a lot to the organization and the user base.
5) Expose Private User Data In Stream Processing
Users leave a lot of data footprints while accessing the internet. While integrating and accumulating information from the web, it is extremely important that user private data is not mishandled. Streaming applications require instant processing of data. They also require user data to be secure. Providing both at the same time is a big challenge for a data streaming application.
6) Private vs Public User Data
Some of the user data can be essential for building some aggregated metrics for public data. How to find the balance between what is private and what is public data, is a challenge all the data sensitive applications have to face.
7) Visibility In Stream Processing
Visibility involves presenting the results of an application in the most intuitive and clear way. Real time data visualization requires a lot of cpu and gpu resources , which is a serious challenge for a big data streaming application.
With a proper feedback design, users can keep track of their processes. Since the internals of most of the streaming data applications are extremely complex and almost like a black box for the users, it is very difficult to be able to provide a good feedback the users for any kind of analysis post processing.
9) High Complexity Of The System in Stream Processing
Data in streaming applications is extremely heterogeneous. This, along with strict real time and high throughput constraints , makes the hardware and software of a streaming application much difficult for a normal user to understand. Also, data is huge and is extremely difficult to reproduce. This makes testing the system challenging.
10) Continuous Evolution Of The System In Stream Processing
Data in traditional database systems is stable, but data in streaming applications is continuously evolving. This poses a serious challenge while testing the system. The testing results not only have to be extremely fast, they also have to be accurate, and they have to represent the correct state of the system at every point, while the system is continuously changing.