- Save the Scala code for the UserEcho function to the file UserEcho.scala.
-
Create the directory structure:
mkdir -p ~/"directory"/src/main/scala/"packageName" mkdir -p ~/"directory"/lib
-
In the directory ~/"directory", create the file build.sbt, with this content:
name := "UserEcho" version := "1.0" scalaVersion := "scala_version" libraryDependencies += "org.apache.spark" %% "spark-core" % "spark_version"
-
Find the dependency files, aster-spark-extension-sparkspark_version.jar and spark-assembly-spark_version
hadoop_version.jar.
Using the find command:
- If Aster and Hadoop are on different clusters, find aster-spark-extension-sparkspark_version.jar on the Aster cluster and spark-assembly-spark_version hadoop_version.jar on the Hadoop/Spark cluster.
-
If Aster and Hadoop are on the same cluster, find both files on that cluster.
If it is unclear which version of aster-spark-extension-sparkspark_version.jar to use:
-
On the Hadoop/Spark cluster, issue this command:
hdfs dfs -ls /user/"sparkJobSubmitter"/"sparkJobSubmitter"
The command returns a version number. - On the Aster cluster, find the aster-spark-extension-sparkspark_version.jar with the same version number.
- Copy the dependency files to the directory ~/"directory"/lib.
- Copy UserEcho.scala to the directory ~/"directory"/src/main/scala/"packageName ".
-
Ensure that the directory structure is:
directory build.sbt lib aster-spark-extension-sparkspark_version.jar spark-assembly-1.6.0-hadoophadoop_version.jar src main scale packageName UserEcho.scala
-
Build the function:
% cd directory % sbt package
The sbt command creates the file ~/"directory"/target/scala-2.10/ userecho_2.10-1.0.jar . -
Copy the UserEcho function to the hdfs directory of sparkJobSubmitter:
cp ~/”directory”/target/scala-2.10/userecho_2.10-1.0.jar to /tmp su – hdfs hdfs dfs -put /tmp/userecho_2.10-1.0.jar /user/"directory"/"directory" hdfs dfs -chown "directory":"directory" /user/"directory"/"directory"/userecho_2.10-1.0.jar
-
Run the UserEcho function:
SELECT * FROM RunOnSpark ( ON people SPARKCODE ('myfunctions.UserEcho') OUTPUTS ('name varchar(250)', 'age integer') SPARK_CLUSTER_ID ('Aster_Spark_*_site.json-socket-SSh') APP_RESOURCE ('hdfs:/user/"sparkJobSubmitter"/"sparkJobSubmitter"/userecho_2.10-1.0.jar') JARS (‘hdfs:/user/"sparkJobSubmitter"/"sparkJobSubmitter"/aster-spark-extension-sparkspark_version.jar') ) sp;
The RunOnSpark function requires the information provided by the APP_RESOURCE and JARS arguments; however, you can omit these arguments from the query if you provide them in the configuraton file.