Being Relevant: How Data Science and Spark Open the Next Frontier for Business Innovation
By Ram Himmatraopet, Founder & CEO, Smarter Data, Inc.
There used to be a time when you had a great product, you ruled the market. You had a great marketing team, you told a good story, you priced it right and people bought it. You beat the competition. Consumers neither had as many choices as we do today, nor as many great products. And Consumers were not as informed or empowered as they are today with mobile, social and cloud technologies.
In the current era of empowered Consumer, the only sure way you can achieve business differentiation is by being relevant. How do you deliver relevance? By constantly learning and adapting to the context, of course. But, how do you do it at scale? Through individualized insight or contextual insight with technologies like Big Data and machine learning.
At SmarterData, we work with Enterprise Customers that are focused turning data into insight with sophisticated analytics technologies like data mining, data science and machine learning. This allows organizations to be relevant in today’s complex, fast changing and fast paced economy.
‘Interactions with Data’ is a key driver of innovation in the future.
Over the last few years, Hadoop gained popularity for it’sHadoop Distributed File System. Hadoop has been the parallel data processing platform for batch processing of big data. While it’s a solid platform, it is considered difficult to program, raising the bar for wider enterprise adoption. Performance was another challenge as these jobs usually take a long time to run.
Thankfully, alternative architectures have emerged. One such alternative is Apache Spark, originally developed at the AMPLab at UC Berkeley, is seeing meteoric rise in support among the community bypassing HadoopMapReduce.
Enterprises are increasingly looking for a Big Data Platform that offers flexibility in development tools to match in-house skills, and support for both batch processing and real time processing. While broad based open source community support is a great indicator of where the technology headed in deciding where to make investments, Enterprises are also looking for better commercial support and integration options.
Spark makes it easier than ever to build Analytic applications with big data. It has native API support for wide range of programming languages that Data Scientists, Data Engineers and Application developer like to work in: Java, Scala, Python, R, etc. It also has SparkSQL to support existing SQL skills.
If you thought software changed the world, then wait till you see more and more of how Data driven insight and actions will help us solve newer problems and take us to newer heights.
Until recently, software applications mostly leveraged descriptive (characteristics, demographics, etc.) and behavioral (transactional, usage history, etc.) data sets. Now more and more businesses have amassed Interaction data (Call center transcripts, e-mail/chat, web site clickstreams, machine data, etc.) and Attitudinal data (reviews/opinions, surveys, etc.) While this new data is a gold mine of Customer insight, it’s mostly unstructured data and been unused for business insight for lack of effective natural language processing. With Spark M-Lib, you can embed natural language machine learning algorithms into your business applications to drive more personalized Customer interactions.
IBM is calling Spark – an Analytics Operating System. I believe it’s an apt description. Spark is an open source in-memory compute engine powering a stack of Analytic tools including Spark SQL, MLib for machine learning, GraphX, and Spark Streaming for real-time. And, Spark API is much more easier to use than MapReduce.
Companies like IBM, Novartis, AirBnB, Yahoo, Intel, Baidu, Groupon are already using Spark. I believe we will see sustained momentum for Spark going into 2016.
So when should you consider using Spark?
- Big Data jobs that require speed. e.g., 100X MapReduce.
- Machine Learning Algorithms.
- Data Mining
- Real time stream processing (live events, fraud detection, real time queries, etc.)
- Sensor data processing (Predictive maintenance, etc.)
- Working with data from a wide variety of sources
- Programming flexibility (Java, Python, SparkR, Scala, etc.)
At SmarterData, we have embraced Spark for it’s Speed, Programming flexibility, data source flexibility including Hadoop, and API flexibility for a variety of Analytics including machine learning. Open source Spark lowers the barrier and makes an excellent choice for Big Data Analytics.