Skip to main content

Getting started using Bayes Server and Apache Spark

This section contains examples of how to run Bayes Server 7 on Apache Spark. This page contains instructions on setting up a Spark project using Bayes Server, and how to run on a cluster.

The example code provided is written in Scala, but you can just as easily use Spark with the Bayes Server Java API.

While the Bayes Server .NET api fully supports distributed processing, it has not been tested on Spark.

The Common page contains code that can be re-used to make life easier using Bayes Server on Apache Spark, and can be copied into your project.

Apache Spark logo

Getting started

The following instructions are aimed at getting up and running with Bayes Server and Apache Spark.

The instructions assume the use of Scala, although the steps will be very similar if you are using Java instead.

  • Create a new Scala project in your favorite IDE.

If you are using IntelliJ Idea, you may first need to install the Scala plugin. Other IDEs may also require a plugin to be installed.

SBT users: If you are using SBT to manage your dependencies, since the Bayes Server jar is not in a public repository, simply copy it into a folder called lib in your project. SBT will automatically identify any jars it finds in the project lib folder as dependencies. They will also be included if you later build an assembly/fat jar.

Maven users: If you are using Maven to manage your dependencies in a pom file, you will need to follow the standard approach to adding Maven dependencies that are not in a public repo. For example see local-maven-dependencies.

  • Add Apache Spark as a dependency. To do this, search for the following entry in the public Maven repositories (e.g. search maven):

    • Group id: org.apache.spark
    • Artifact id: spark-core (Select the entry which ends in 2.10 if you are using Scala 2.10, or ending in 2.11 for scala 2.11 etc...)

Most repository search tools allow you to select the format to copy. For example if you are using Scala + SBT select that format, and copy the dependency text (e.g. libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.6.0") into your build.sbt file.

If you are running your Spark project on a cluster, the recommended approach is to build an assembly/fat/uber jar. This makes life easier when running your project on the cluster via spark-submit. Note that since Apache Spark will already be installed on your cluster, to reduce the size of your assembly jar you can set the provided scope on the spark-core dependency. In SBT simply add && provided, and in Maven set the Scope to Provided scope.

  • Copy the code on the Common page into a new file in the following location:

    <project root>/src/main/scala/com/bayesserver/spark/core/BayesSparkDistributer.scala

  • Copy the code on the Mixture model (learning) page into a new file in the following location:

    <project root>/src/main/scala/com/bayesserver/spark/examples/parameterlearning/MixtureModel.scala

  • Create a new file in the following location (or under the equivalent directory under your project namespace):

    <project root>/src/main/scala/Main.scala

    And add the following code to Main.scala:

import com.bayesserver.spark.examples.parameterlearning.MixtureModel
import org.apache.spark.{SparkContext, SparkConf}

object Main extends App {

val licenseKey = None // use this line if you are not using a licensed version
// val licenseKey = Some("license-key-goes-here") // use this line if you are using a licensed version

val conf = new SparkConf(loadDefaults = true).setAppName("Bayes Server Demo")

conf.getOption("spark.master") match {
case Some(master) => println(master) // master has already been set (e.g. via spark-submit)
case None =>
conf.setMaster("local[*]") // master has not been set, assume running locally
}

val sc = new SparkContext(conf)
MixtureModel(sc, licenseKey)

}
  • Compile and run in your IDE to test.

If you have used the provided scope on your spark-core dependency, you may need to remove it when running in your IDE.

Running on a cluster

  • Package your jar as an assembly/fat jar. You can use the assembly SBT plugin for SBT and then call sbt assembly instead of sbt package, or one of the various Maven options.

If you receive deduplicate / mergeStrategy errors when building an assembly jar with sbt, you can find information on merge strategies at the assembly SBT plugin website.

  • The examples above use hard-coded data. When running on a cluster you will need to update the Spark code to read from your distributed data source, such as HDFS or S3.

See the DataFrame page for an example of using Data Frames with Bayes Server.

  • Copy the assembly jar onto a node in your cluster.

  • On the same node, run the spark-submit script which is installed with Apache Spark, passing in the option --jars with the path to the assembly/fat jar you just copied to the node. You will also need to pass in an option for --master. See submitting applications for more information or the spark-submit help.

If you wish or need to run in YARN you can set the --master flag to yarn. You may also wish to set --num-executors.