Behind the Curtain An Introduction to Hero_Flow
SalesHero was created by the founding team of Datameer, the award-winning market leader in big data business intelligence that is used by companies such as Citi Bank, Visa, Barclays, Sprint, Deutsche Telekom and Dell (among others). Datameer was built from the ground up and in just five years secured a place in Gartner’s BI and Analytic Magic Quadrant. Needless to say, this isn't our first rodeo. Now we're set to build the first modern AI automation platform.
Our engineering team combines almost a century of big data and AI experience and its members are founders, co-founders and very early contributors of technologies such as JBoss (acquired by RedHat for $900M), Hadoop (now considered a $30 billion market), Bixo (used by EMI Music, ShareThis, Bebo), Katta (used by Apple, Huawei, Chinese Telecom, FBI), ZkClient (used by Apache Kafka, Facebook’s Presto, ATT, Salesforce) and of course, Datameer.
As always, when we started developing any new technologies the first question we were asked was, “Why build another platform if all of these others already exist?” The short answer is that we believe the currently available technology is crumbling and cannot meet the requirements of our rapidly accelerating tech-focused world that is increasingly being driven by artificial intelligence. We need a modern business automation platform.
If autonomous vehicles are already a reality, then how can AI serve businesses? We didn’t see anything on the market that was addressing this issue so we decided to build a new operating system to support scalable autonomous business processes. Introducing Hero_Flow.
The long answer
In the 1990s, client-server architecture and relational database management systems (RDBMS) replaced mainframes for data storage and processing. More than a decade later the digital and mobile revolution b-tree index-based database systems could not scale due to the volume, velocity and variety of data. By implementing the technology described in Google’s File System and MapReduce research papers, Apache Hadoop sparked the big data revolution Hadoop and successor systems such as Spark and Flink freed up data from static schemas and SQL (invented in the early 1970s) with the notion of ‘Schema on Read’ that became increasingly popular to solve major challenges such as data schemas which are constantly evolving. This shortened the time to insight, handled semi-structured-and-unstructured data, and of course, could work with much larger data sets than before. However, the architecture of these currently popular data systems is showing its age. Apache Hadoop’s first release was in 2007, Spark in 2009, Kafka in 2011 and Apache Flink in 2014.
There hasn’t been any major innovation in the last decade – only incremental improvements (we shouldn’t consider streaming [Kafka, Storm, Flink Streams] as an innovation since streaming message queues have been around for decades).
The major challenge with these systems is that they are architected for big data storage, prep and analytics. That means they optimize for a few batch or streaming jobs. Try serving millions of users from a Hadoop file system or span millions of micro jobs per hour in a Spark cluster and it will fail miserably. A workaround for this challenge is the Lambda architecture in which two separate and parallel maintained, monitored and integrated systems are implemented – one for slow big data batch processes and one for small, fast data. In fact, our founder was the first to promote this architecture a decade ago at the first Hadoop Summit in 2008.
However, the workload profile of an AI automation platform is very different from traditional Hadoop, Kafka or Flink workloads. The workload profile is composed of continuous big data jobs to (re) train deep learning models but doesn’t have any downtime when deploying model updates. It can serve millions of requests per second through business process-defined micro streams to make business-critical decisions in real time.
The Need for Speed:
Traditional big data systems are optimized for throughput by taking advantage of sequential data access (even SSD or RAM is magnitudes faster than random data access). This advantage should be used to prep data and train deep learning models, not for business operations. Today’s customers who write complaint emails don’t send them in sequential order by their hash user ID. In order to serve AI business automation workloads, the platform needs to be incredibly fast to span micro streams triggered by events. For example, imagine a sales rep is creating a quote for a customer on the phone. The AI platform needs to make real-time cross-and-upsell recommendations – not 10 min after the customer has hung up because it took too long as a batch job. What is needed is a fast data platform.
Mission Critical Reliability:
Traditional big data systems weren’t architected for continuous business-critical operations. It took years for Hadoop to implement a NameNode failover to avoid a single point of failure (SPOF). Still, today, Hadoop, Spark, Flink and Kafka rely on a centralized data structure for state management. While these centralized data structures are now distributed and synchronized over a set of machines using Apache Zookeeper it still has not fundamentally solved the SPOF problem but only made it a little more reliable by distributing it over a set of machines. However, this solution doesn’t solve complex challenges such as split brain (network) situations that are common in cloud and multi-data center deployments.
In a world in which we want to automate business processes through AI we need a mission-critical, reliable platform. A platform that is self-healing, has no centralized state and smartly resolves state conflicts instead of relying on centralized state management.
Combination of Data-flow and Business Work-flow:
To automate business processes with AI, business workflows need to be driven through AI models that were trained on historical data and that are continuously improved and refreshed. Traditional business process automation platforms are modeled manually and are less powerful and efficient than AI models that are trained on historical data. Individual steps such as updating a customer address or creating a quote are done manually by the human workforce.
A modern AI automation platform should drive and automate these steps through AI. The human workforce will remain a critical component in the workflow, but the real goal is to automate as many steps with AI – from data entry to reaching out to customers at the perfect time – as possible. At this point there is no platform in the market that combines a fast data platform with AI to drive business workflows.
In distributed system research, one of the most discussed challenges is the CAP Theorem. The CAP Theorem states that among availability, consistency and partition tolerance, a distributed system can only achieve two out of three goals.
Availability means that the system has high availability and can deal with failure of part of the distributed system and will continue to provide full functionally to clients. Consistency means that all data in the system at any given time is consistent for all clients. And, partition tolerance means that when a system is distributed between a data center in the US and Europe, even if the connection between the continents is disturbed, the separated systems will continue to operate.
Hero_Flow: The ‘fast data’ AI business automation platform
Hero_Flow was developed from the get-go as a new business operating system combining fast data, AI and business process management capabilities to form a new AI business automation platform optimized for workloads in a modern and AI-focused world.
The SalesHero team worked for two years on getting the fundamentals right to ensure a future-proof platform that can scale for any size organization in modern deployment scenarios. Below are some highlights of the platform.
Hero_Flow is implemented from the ground up as a Reactive System. That means four major goals drove all architecture decisions.
- The platform needs to be responsive to client requests, providing rapid and consistent response under any circumstance. Responsiveness is the cornerstone of usability and consistency is critical to achieving user confidence. Further, this also implies a fail-fast approach to identify and resolve errors quickly.
- The platform needs to be resilient to sustain infrastructure, hardware, network, user and code errors as it maintains responsiveness. This resiliency is achieved through loosely coupled, contained components that are replicated and failure management is implemented by delegation. This allows the system to maintain operations even if a component fails as the failure is managed and dealt with by an external higher component, for example, by restarting a component on a different cluster node.
- The platform has to be elastic, meaning that even under load peaks it maintains responsiveness and therefore is able to expand or contract based on workloads. This implies that the system can’t have bottlenecks and is designed in a way that everything can be partitioned and any workload can be distributed over an elastic group of cluster nodes. The system needs to provide relevant life performance measures to support predictive and reactive scaling algorithms.
- The platform is completely asynchronous message-driven. A message-driven approach in comparison to blocking-method calls allows a loose coupling and location transparency for all system components and therefore enables the ability to distribute the message processing to an elastic group of processors. It also allows for better measurement and understanding of internal loads as well as increased resiliency by avoiding deadlock.
Historical big data platforms such as Hadoop or Spark have a master-worker architecture, meaning that it is centralized (distributed over a few nodes with Zookeeper) state management and communication architecture. Especially in very large clusters centralized communication results in bottlenecks in critical situations such as when a large number of nodes in the cluster needs state update (e.g. after a network failure).Hero_Flow avoids this challenge by implementing a gossip communication protocol that is modeled after the way social networks disseminate information. Instead of a central entity communicating with thousands of nodes, Hero_Flow’s gossip protocol allows each node to communicate with a random set of other nodes to quickly spread information and scale while also maintaining resiliency by operating without a centralized master entity and overcoming network issues and node failure
Eventual consistency / conflict-free replicated data types
As described in the CAP Theorem, a distributed system can only achieve two out of three goals. A novel workaround to avoid system partitioning due to network failures is the concept of eventual consistency. Instead of keeping mission-critical system data centralized, the data is written to many nodes directly and further replicated in the cluster using the gossip protocol. Eventually, over time, all nodes receive the data update – even nodes that re-join a cluster after a network partitioning. There is no maximum number of nodes the client can read the data from and with more nodes the likelihood of getting outdated data decreases.
The client also has a number of different approaches it can take to avoid inconsistency. A common approach is the quorum protocol such as Paxos in which consensus is reached via quorum. However, this approach does not scale very well in very large clusters. For example, that’s why Zookeeper runs on a limited number of nodes in systems such as Kafka. Hero_Flow uses the more scalable and performant approach of conflict-free replicated data types for distributed state management. The idea is that a data type that also contains its history can be merged without conflict. To communicate the concept with an oversimplified example, imagine a HashMap that stores each update and its timestamp. The client is then able to merge the data it read from a number of nodes even if the data in different nodes had different updates and are unaware of these updates. A popular implementation of this concept is the Git distributed version control system that is able to partition and merge branches.
No single point of failure or centralized state management
By using the gossip protocol, eventual consistency and conflict-free replicated data types, Hero_Flow avoids any single point of failure and implements a fully decentralized state management. This means that in case of a cluster partition or hardware failure the split-off cluster segment is able to autonomously continue operation and can later merge back into the cluster without any conflict. Hero_Flow can operate in very large clusters distributed over many racks and data centers to achieve high availability with built-in disaster recovery mechanisms.
Scalable Actor system
In order to achieve our goal of implementing a reactive (responsive, resilient, elastic and message-driven) system, Hero_Flow was built as an Actor System in which each component is an actor that communicates with other actors through asynchronous messages. The concept of an actor model was first formalized by Carl Hewitt in the early 1970s and was made popular through the programming language Erlang used in massively scaled telecommunication network equipment. Hewitt noted, “The actor model was inspired by physics, including general relativity and quantum mechanics. It was also influenced by the programming languages Lisp, Simula and early versions of Smalltalk, as well as capability-based systems and packet switching.”
Over the years a large amount of research from MIT, Caltech and Stanford improved the initial concepts to make actors the most promising platform for highly distributed systems today.
Distributed – Java 9 streaming
We believe in open standards so it was easy to select which data processing mechanism to use. Java ™ 9 introduced Reactive Streams that perfectly matched our design goals (and even helped inspire the name for our system – java.util.concurrent.Flow). Initially formalized in the Reactive Streams initiative started by companies such as Netflix, Java 9 Reactive Streams formalized a universal standard to process data through a set of processors connected by a publish-subscribe mechanism. A unique advantage is the non-blocking implementation of back pressure where in contrast Kafka implements back pressure through a blocking queue. We extend the Reactive Streams to operate in a distributed, highly-scalable environment adding data partitioning capabilities, failure tolerance and speculative execution capabilities.
Another unique advantage of the Hero_Flow platform is the possibility to pin certain flows or processes for hardware groups. For example, it’s possible to pin a stream to a set of nodes with GPUs to accelerate Google Tensorflow execution.
Dynamic partitioning / elastic auto scaling
Compared to traditional big data platforms, Hero_Flow is architected from the ground up to work in an elastic environment with a continuously changing number of nodes (system loads vary and the cluster can react by expanding and contracting). This means the system can dynamically partition data to parallelize processing in a number of different ways. Systems such as Hadoop or Kafka are limited in this capability (for example, Kafka needs to pre-partition topics at the start and can only re-distribute those specific, pre-defined partitions). With Hero_Flow, it’s possible to choose from different strategies (for example, partition data based on currently available cluster resources, defined partitions, increased cluster resources, or dynamically expand and contract using a work-pulling pattern).
Performance and utilization
Hero_Flow is built for responsiveness, meaning that there is a sophisticated thread and concurrency management to achieve maximum throughput with minimal CPU context switching. The Hero_Flow runtime on every node is minimal and is optimized to use as little CPU, memory and IO as possible. On a single core of a 2.9 GHz Intel Core i7 CPU, Hero_Flow achieves 1.9 million 100 byte messages per second without any IO or CPU boundary. Even more, Hero_Flow uses the fastest in-class object serializer that outperforms Google Protobuffer for complex objects by more than 40 percent without needing to manually define data types and generate Java classes. To optimize hardware utilization, Hero_Flow can pick nodes in the cluster based on equal CPU or memory selection profiles and deploy stream partitions on nodes that had lower-than-average CPU utilization over the last few minutes. Finally, in order to react to business events, Hero_Flow is capable of spanning new streams within a few hundred milliseconds.
To ensure responsiveness, Hero_Flow expands the cluster size by interacting with AWS scale groups, Mesosphere, Hadoop Yarn or Kubernetics, predictively or reactively (for example, should the average CPU load exceed defined thresholds).
With extensive experience in the financial services and telco industries, we understand the complex security needs of a data platform to fulfill internal and regulatory requirements. As some implementations need to be optimized for security and some for performance, Hero_Flow allows granular configurations of individual security capabilities, including:
Hero_Flow security capabilities include:
- Pluggable encryption algorithms and strength
- Encryption of data from source to stream
- Encryption of data from stream to sink
- Encryption of any temporary written data (e.g. when data needs to be cached for a reduce-side join)
- Encryption of all communication between the node
- Encryption of data from command line or API clients to the cluster
- Https access for the web-based admin console
- Integration into authentication providers such as OAuth, LDAP or Microsoft Active Directory
- Integration into Java Authentication and Authorization Service (JAAS) with connection to Kerberos on request
- Audit and access logs
- Encryption key rotation (on request)
Advanced metrics & monitoring
To run a business-critical platform in a production environment, advanced monitoring and alerting is a non-negotiable for every dev ops team. Hero_Flow provides advanced metrics and flexible monitoring integration with a very lightweight footprint on the platform. Hero_Flow provides node health and node utilization metrics, in addition to throughput, back pressure and health metrics for every function, source and sink currently running in the environment. This allows instant pinpointing of cluster hardware, network, performance or custom code issues. A broad set of monitoring system connectors to collect, visualize and create alerts is available.
- UDP based StatsD
- HTTP(S) REST API
- Datadog and others (on request)
Hero_Flow comes with an easy-to-extend connector framework to integrate any kind of data store to read or write data.
- Office 365 / Outlook
- SAP NetWeaver
- JDBC (Oracle, MS SQL Server, Teradata, DB2, MySQL, Postgress, AWS Redshift, SnowFlake etc)
- Hadoop Distribute File System
- Apache Kafka
- AWS Dynamo DB
- AWS S3
- AWS SQS
- Google Cloud Pub/Sub
- Azure Storage Queue
- Mainframe (on request)
Hero_Flow tightly integrates with Google Tensorflow to train and execute a wide range of deep learning models. Existing models can simply be installed and used as a function in any data or business process workflow and take advantage of the enterprise readiness of the Hero_Flow platform. Hero_Flow can be used to train pre-defined (e.g. in Python) models or Hero_Flow’s AI design studio can be used to visually create deep learning models. In addition, the platform comes with three advanced deep learning-based engines:
Hero_Flow’s recommendation engine is commonly used for cross and upsell, churn prediction and best next step suggestions (or similar use cases such as content recommendations).
Our engine shows best-in-class performance using the latest innovations in the deep neuronal network space. It provides a high degree of flexibility of data can be used to create user and item features. A customizable representation function can be used to convert any kind of data, including behavior, time series and low-dimensional vectors. The recommendation model is then trained with a customizable loss function depending on the problem space.
The Hero_Flow intent detection engine can identify the intent of written communication such as emails, text messages, or online forms and posts such as the intent to schedule a meeting, return a product or change an address. Using novel concepts from transfer learning, the Hero_Flow intent detection engine advantage reaches high accuracy with small dataset training. The underlying sequence-to-sequence model is language independent and therefore can be trained for most languages.
Dark Data Extraction Engine
Hero_Flow’s dark data extraction engine can be trained to extract structured data from unstructured data. Commonly used for extracting names, job titles, phone numbers, address or product data from text conversations, it also can be extended to first convert (e.g. image data to text) and then extract meaningful structured data. The language-independent model can train a wide set of named entities and use a performance-optimized sequence-to-sequence deep neuronal network.
Bringing humans and AI together for maximum productivity
The integration of AI and people is one of the most difficult challenges. We’ve spent a lot of time thinking about this and experimenting with different approaches to ensure a positive and, most importantly, engaging experience for users. Our approach was pioneered by Prof. Luis von Ahn, the inventor of Captcha, pioneer of crowdsourcing and founding CEO of Duolingo. In his famous Google talk discussing the ESP Game he introduces the concept of games with purpose.
Hero_Flow allows at any given stage of a data or business workflow to integrate human validation or motivation processes. For example, Hero_Flow can extract structured data from an email that needs human validation. It does so by personifying the user experience as an AI assistant and creates a gamified, engaging experience to motivate the user to help with validation. Whenever possible, Hero_Flow will integrate with existing business games such as achieving quota in sales or being recognized as the best sales rep. SalesHero’s AI assistant Robin presents itself as an ally to the sales rep by automating time-consuming and repetitive tasks and supporting the rep through augmented intelligence such as next-step or cross-and-upsell recommendations.
Hero_Flow offers a platform that can help with individual use cases but can also scale to serve the business operation needs of large organizations. As data flows through data preparation, analytics and AI micro-service functions, Hero_Flow connects data sources, business systems, business processes and people to form an intelligence fabric that automates business processes and augments decision making.
As Hero_Flow is deployed throughout a business, this fabric of AI business automation can be iterated on, step-by-step, starting with high-impact, low-implementation effort processes. By continuously adding more data sources and leveraging feedback loops, the precision of the AI models will significantly improve.
Deployment cloud / on-premise
The Hero_Flow platform can be deployed in various ways. Platform nodes are simple Java processes and can, therefore, run on top of popular cluster resource management frameworks such as Hadoop Yarn or Mesos. But, the platform can also be native or containerized and therefore can be run and managed as an AWS scale group or Docker container in systems such as Kubernetes.
A set of Java APIs is available to implement custom sources, sinks and functions. Building these functions is easy as we provide single JVM runtime to test customizations. Developer training and consulting are available on request to ensure teams are up and running quickly.
Data_Flow Design Studio (beta)
The Data_Flow Design Studio allows admins to visually create data flows. After selecting a previously set up data source, a repository of data preparation, manipulation and machine learning functions are available to point-and-click together simple or complex data flow streams that are eventually streamed into a sink. Streams can be linear but also support circular structures.
If a stream requires human validation (e.g. approve a recommended CRM update), only a simple step of adding a validation function to the data flow is required. The data is presented as a to-do item to the end user for validation on a regular cadence. The data stream is continued as soon as an end user validates and approved an item. Alternatively, if a stream only provides recommendations such as best next-step or cross-and-upsell recommendations there is a recommendation sink available that delivers the recommendation on a regularly defined schedule to the end user.
AI_Design Studio (alpha)
Hero_Flow also provides an AI_Design Studio that allows every novice to visually create and test deep learning models. A wide variety of models is such as RNN, LSTM, Seq2Seq are already support but we continuously expand this list.
Feature engineering is done via streams designed in Data_Flow Studio. Once designed, these streams can be used to train or later feed a model during production. When a model is designed, trained and validated it can be exported as a function and moving forward is available as a custom point-and-click function in the Data _Flow Studio.
It is also possible to retrain a model and automatically update it with that function. This allows ongoing improvement to the AI models used in production.