Recently, I had a series of discussions with a financial domain customer to help them to choose a framework or product for real-time processing. Most of the major software vendors in big data engineering space provide multiple products for catering similar use cases. It is not easy to gravitate to one solution, say spark or storm or azure stream analytics in Cortana or Infosphere Stream in IBM stack. There are a lot of challenges associated with stream processing and just using a tool or framework will not solve these challenges. A real-time analytics ecosystem need to be designed that support complex event processing, scalability on volume and velocity of data, integrating with EDW and big data, analytics suite for on process data discovery and automated alerts ,etc.
What are the top 5 challenges while designing a real-time stream analytics solution? Below are my considerations.
- Data comes in very high velocity and variety: Data comes from a large number of sources like devices, social media, IT logs, DB transactions, etc. How can we process these continuous and parallel stream of data coming at high speed? In this data waits for no one situation, how will we arrive at a feasible time window for analysis and coordinating with streaming engines latency? How right event sequencing can be achieved?
- Real time alerts and predictions: How can we enable the alerts based on anomalies and thresholds? How can we create and choose a real-time predictive model? What type of alerting mechanism need to be chosen (email, sms, etc.)? How alerts can be enabled on query results and represent in a real-time dashboard?
- Failover and Recovery mechanism: How can we assure "at least once" or "minimum once" processing strategy without losing or duplicating the results? What failover and recovery strategy need to be applied?
- Integration with Enterprise Systems including Big Data systems: How can we achieve a cohesive environment by integrating real time analytics system to existing EDW and big data systems? How real time analytics can leverage existing high volume batch analysis and vice versa? What modification will be required in existing business workflow while incorporating real-time analytics?
- Security: How can we elevate enterprise level security policies to streaming solutions including data streams, applications, data stores, roles and fields? How can we secure access to streaming platform both software and hardware nodes?
This post is full of questions! I'll try to answer these questions in the coming posts, meanwhile let me know your thoughts also.