The Fast Data app simulates real-time click stream processing. Click events are ingested into VoltDB at a high rate, then cleaned up and persisted into a data warehouse for historical analysis. Segmentation information is calculated in data warehouse and stored back into VoltDB. VoltDB uses the information to segment real-time events for per-event decisioning.
The click events can come from persistent queues like Apache Kafka. For simplicity, this sample app generates random events at a constant rate. Each click event contains basic information like source IP address, destination URL, timestamp, HTTP method name, referral URL, and the user agent string. More information can be included in a real-world use case.
The stream is ingested into VoltDB, cleaned up, then exported into a data warehouse like Hadoop for long term persistence. The data warehouse runs machine learning algorithm on the full historical dataset to segment the click events into clusters periodically. Each cluster is represented by its center. The cluster centers are sent back into VoltDB.
VoltDB uses the clustering information to further segment new click events into the corresponding cluster at real-time. VoltDB can use this to make per-event real-time decisions like what advertisement to show a user, or whether a user should be blocked for spamming.
Events are aged out of VoltDB after they are persisted in the data warehouse. This limits the dataset inside VoltDB to only include relatively hot data.
The example includes a dashboard that shows click stream cluster distribution, top users and top visited URLs in a moving window. All of these information is calculated from materialized views on relevant source tables.
The code is divided into projects:
See below for instructions on running these applications. For any questions, please contact firstname.lastname@example.org.
Before running these scripts you need to have VoltDB 4.8 or later installed, and the bin subdirectory should be added to your PATH environment variable. For example, if you installed VoltDB Enterprise 4.8 in your home directory, you could add it to the PATH with the following command:
This demo requires a Cloudera Hadoop installation, whose configured services also include Spark. The demo consists of a VoltDB server writing export data to files stored in Hadoop, and a set of scripts, and programs that collect the exported data, compute k-means clusters on the collected data, and store the computations back to VoltDB.
On the server where VoltDB is installed set the environment variable
WEBHDFS_ENDPOINT to a WebHDFS endpoint that matches the following pattern
Download this archive and unpack it on an Hadoop node
$ tar -jxf fastdata-kmeans.tar.bz2
Change your working directory to
fastdata-kmeans and run the
script when you to want to process exported data from VoltDB (see
Demo Instructions section bellow)
$ cd fastdata-kmeans # # Assuming you are exporting to export/fastdata, and VoltDB is # running on volthost $ ./compute_clusters.sh export/fastdata volthost
harvest.pigpig script to write the harvested data into a Parquet database
load.pigscript that reads the K-Means computation Parquet database, and writes its content back to VoltDB by utilizing the VoltDB hadoop extensions
This example also requires Vertica installed on the
same machine or a machine that the VoltDB machine has access to. A Vertica
database must be created with the username dbadmin with no password. Once
Vertica is running, set the environment variable
VERTICAIP to point to the
Vertica machine on the VoltDB machine. For example, if your Vertica is
running on 192.168.0.1, run the following command on the VoltDB machine:
VoltDB uses JDBC to export click events to Vertica. Vertica's JDBC driver
must be present in the classpath before starting VoltDB. If your Vertica is
/opt/vertica, you can find the JDBC driver in
/opt/vertica/java/lib/. Copy the JAR file in that directory to the VoltDB
machine and put it in the
lib/extension sub-directory in your VoltDB
To run the K-means clustering algorithm in Vertica, it requires the R language package to be installed. Please follow the instructions in Vertica documentation to install the R language pack.
vertica sub-directory in the example to the vertica machine. This
directory contains Vertica extensions for K-means clustering and loading data
back into VoltDB. The following instructions assume that the directory is
Start the web server
Start the database and client
# for Hadoop run $ ./run.sh hadoop_demo # while for Vertical run $ ./run.sh demo
Open a web browser to http://hostname:8081
For Vertica run the
updatemodel.sh script on the Vertica machine to run
the K-means clustering algorithm on the data in Vertica. You can run this
command periodically to update the cluster model in VoltDB.
To stop the demo:
Stop the client (if it hasn't already completed) by pressing Ctrl-C
Stop the database
$ voltadmin shutdown
Stop the web server
$ ./run.sh stop_web
You can control various characteristics of the demo by modifying the parameters passed into the LogGenerator java application in the "client" function within the run.sh script.
Speed & Duration:
--duration=120 (benchmark duration in seconds) --ratelimit=20000 (run up to this rate of requests/second)
Before running this demo on a cluster, make the following changes:
On each server, edit the run.sh file to set the HOST variable to the name of the first server in the cluster:
On each server, edit db/deployment.xml to change hostcount from 1 to the actual number of servers:
<cluster hostcount="1" sitesperhost="3" kfactor="0" />
On each server, start the database
On one server, Edit the run.sh script to set the SERVERS variable to a comma-separated list of the servers in the cluster
Run the client script: