RunOnSpark Arguments - Aster Analytics

Teradata AsterĀ® Spark Connector User Guide

Product
Aster Analytics
Release Number
7.00.00.01
Published
May 2017
Language
English (United States)
Last Update
2018-04-13
dita:mapPath
dbt1482959363906.ditamap
dita:ditavalPath
Generic_no_ie_no_tempfilter.ditaval
dita:id
dbt1482959363906
lifecycle
previous
Product Category
Software
SPARKCODE
Specifies the name of a function and its arguments. The function must be in the jar file specified by the APP_RESOURCE argument.
Required: Yes
OUTPUTS
Specifies the names and data types of the output columns.
Required: No
MEM_LIMIT_MB
Specifies the maximum number of megabytes to allocate for the data transfer buffers.
Required: No
Default: 32 MB
Configuration file key: (run-on-spark, mem-limit-mb)
TIMEOUT_SEC
Specifies the time, in seconds, after which the query times out (is canceled). If timeout_value is 0, timeout handling is disabled.
Required: No
Default: Set to '0' during installation process.
Use the default value only for valid RunOnSpark queries. Disabling timeout handling can cause RunOnSpark queries with malformed parameters to wait forever.
Configuration file key: (run-on-spark, timeout-sec)
STATUS_INTERVAL_SEC
Specifies the time interval, in seconds, after which to check the Spark job status. If status_interval_value is 0, status checking is disabled.
Required: No
Default: Set to '0' during installation process.
Use the default value only for valid RunOnSpark queries. Disabling status checking can cause RunOnSpark queries with malformed parameters to wait forever.
Configuration file key: (run-on-spark, status-interval-sec)
APP_RESOURCE
Specifies the location of the Teradata Aster Spark jar file on the Hadoop/Spark cluster. The aster_jar_location can be either an HDFS location, shared Linux drive, or local Linux drive. If you wrote your own functions and put them into your own jar file, aster_jar_location must be the location of your jar file.
Required: Unless you provide this information in the configuration file.
Default: Set to 'hdfs://user/sparkJobSubmitter/sparkJobSubmitter/aster-spark-extension-sparkversion.jar' (HDFS location of Aster Spark jar) during installation process.
Configuration file key: (spark-params, app-resource)
JARS
Specifies additional jar files. If you wrote your own functions that invoke the , specify the location of the Teradata-supplied aster-spark-extension-sparkversion.jar file.
Required: No
Default: None
Configuration file key: (spark-params, jars)
EXTRA_SPARK_SUBMIT_OPTIONS
Specifies extra options to include when submitting the Spark job. The syntax of option_value_pair is:
option value
Required: No
Default: None
Configuration file key: (spark-params, extra-spark-submit-options)
SPARK_CLUSTER_ID
Specifies the name of the Hadoop/Spark cluster in the configuration file to be used to execute the spark function that SPARKCODE specifies. A Spark instance can have multiple entries, which have different parameters in the configuration file. A query can have multiple RunOnSpark functions, which reference either the same or different Hadoop/Spark clusters.
Required: No
Default: Set during installation to a Spark-cluster-identifier created during installation; for example: 'AsterSpark_namenode_site.json-socket-noSSh'.
Do not use the default value without overriding any of its settings unless you know what the default Spark-cluster-identifier is and are sure you want all the installation-supplied settings associated with it. Teradata recommends that you start by explicitly specifying all the settings you want. When those work, you can put your desired settings into your own Spark-cluster-identifier, make it the default in the configuration file, and omit SPARK_CLUSTER_ID from your RunOnSpark queries.
Configuration file key: 'default' (the default cluster specified in the configuration file)

For information about the configuration file, refer to Configuration File: spark.config.

DATA_TRANSFER
Specifies the method for transferring data to and from Spark:
  • 'file':

    Transfer the data to a distributed set of files (for example, on HDFS). The Spark application reads data from these files and writes output to a set of files.

  • 'socket-persist':

    Transfer the data directly to and from the Spark application through sockets and persist the sent data to files on the Spark side.

Required: No
Default: Set during installation to 'file' or 'socket-persist' for each Spark-cluster-identifier created during installation.
Configuration file key: (run-on-spark, data-transfer)
PERSIST_LOCATION
Specifies HDFS location to use for results received from Spark.
Required: No
Default: Set during installation to the HDFS directory; for example: 'hdfs://user/sparkJobSubmitter/tmp'.
Configuration file key: (run-on-spark, persist-location)
SPARK_PROPERTIES
Specifies additional Spark properties to apply when running the Spark job. An example of spark_property_name is spark.executor.memory, which specifies the amount of memory to use for each executor process.
Required: No
Default: None
Configuration file key: (spark-properties, spark-property-name)
REST_URL
Specifies the Spark REST URL that the RunOnSpark master instances uses to submit jobs, query their status, and cancel them if they run beyond their timeout limit.
Required: Unless you provide this information in the configuration file.
Default: Set during installation to a 'http://namenode/ws/v1/cluster/apps/'.
Configuration file key: (spark-params, base-rest-url)
SSH_HOST
Specifies the name of the user with whose credentials the Spark job runs and the host where the Spark job starts. The user extensibility must have OpenSSH access to user@host to run the commands SPARK-SUBMIT and YARN.
Required: When USE_REMOTE_SSH is 'true'.
Default: Set during installation to 'sparkJobSubmitter@namenode'.
Configuration file key: (run-on-spark, ssh-host)
IDENTITY_FILE
Specifies the identity file (.pem) path to use with remote ssh when you do not want to enable passwordless ssh to the Hadoop/Spark cluster.
Required: When passwordless ssh is disabled and USE_REMOTE_SSH is 'true'.
Default: Set during installation to '/home/beehive/config/spark/namenode/IDENTITYFILE'.
Configuration file key: (run-on-spark, identity-file)
SPARK_SUBMIT_COMMAND
Specifies the command that starts the Spark job.
Required: No
Default: Set during installation to '/usr/bin/spark-submit' for an ssh Spark-cluster-identifier or '/home/beehive/config/spark/namenode/client/bin/spark-submit' for a non-ssh Spark-cluster-identifier.
Configuration file key: (run-on-spark, spark-submit)
YARN_COMMAND
Specifies the yarn command to invoke.
Required: No
Default: Set during installation to '/usr/bin/yarn'.
Configuration file key: (run-on-spark, yarn-command)
HADOOP_JARS_LOCATIONS
Specifies paths to the Hadoop jar files, used when DATA_TRANSFER is 'file'.
Required: No
Default: Set during installation to '/home/beehive/config/spark/namenode/hadoopjars'.
Configuration file key: (run-on-spark, hadoop-jars-locations)
HADOOP_CONF_LOCATION
Specifies the local path to the Hadoop configuration file, used when DATA_TRANSFER is 'file'.
Required: Required when DATA_TRANSFER is 'file' or when USE_REMOTE_SSH is 'false'.
Default: Set during installation to '/home/beehive/config/spark/namenode/hadoopConf'.
Configuration file key: (run-on-spark, hadoop-config-location)
SPARK_CONF_LOCATION
Specifies the local path to the Spark configuration file, used when USE_REMOTE_SSH is 'false'.
Required: Required when USE_REMOTE_SSH is 'false'.
Default: Set during installation to '/home/beehive/config/spark/namenode/sparkConf'.
Configuration file key: (run-on-spark, spark-config-location)
KERBEROS_AUTHENTICATION
Specifies whether to enable Kerberos authentication on the Hadoop system.
Required: No
Default: None

If you omit this argument, the Aster Database uses the Hadoop configuration files to determine whether Kerberos authentication is enabled on the Hadoop system.

Configuration file key: (kerberos-authentication)
SPARK_JOB_USER_KEY_TAB
Specifies the key tab file location from which the user runs the Spark job when Kerberos authentication is enabled on the Hadoop system.
Required: When Kerberos authentication is enabled on the Hadoop system.
Default: None
Configuration file key: (run-on-spark, spark-job-user-keytab)
FILE_ACCESS_KEY_TAB
Specifies the key tab file location for the accessing HDFS.
Required: When Kerberos authentication is enabled on the Hadoop system and the data transfer method is 'file'.
Default: None
Configuration file key: (file-access-keytab)
KINIT
Specifies the Kerberos kinit command when Kerberos authentication is enabled on the Hadoop system.
Required: When Kerberos authentication is enabled on the Hadoop system.
Default: Set during installation to '/usr/bin/kinit'.
Configuration file key: (run-on-spark, kinit-command)
WORKERS_IP_ADDRESSES

Specifies the IP addresses of the vworkers. The ip_address_start is the string with which all vworker IP addresses start. You must specify this argument if the Aster nodes have multiple IP addresses that are neither public nor accessible from the Hadoop nodes.

Required: No
Default: None
Configuration file key: (workers-ip-addresses)
LOGGING_LEVEL
Specifies the amount of logging information generated (for Spark) and logged to the Teradata Aster SQL-MapReduceĀ® logs (for the RunOnSpark function):
  • 'INFO': Log only information.
  • 'WARNING': Log information and warnings.
  • 'ERROR': Log information, warnings, and errors.
  • 'DEBUG': Log as much information as possible, for help troubleshooting.
Required: No
Default: Set during installation to 'INFO'.
Configuration file key: (run-on-spark, logging-level)
DELIMITER
Specifies the character to use as a field separator when transferring data to and from Spark.
Required: No
Default: Set during installation to '\t'.
Configuration file key: (run-on-spark, delimiter)
NULL_STRING
Specifies the string to use to represent a null value.
Required: No
Default: Set during installation to 'NULL'.
Configuration file key: (run-on-spark, null-string)
USE_REMOTE_SSH
Specifies whether remote ssh is required to start the Spark job on the remote Hadoop/Spark cluster.
Required: No
Default: Set during installation to 'true' for an ssh Spark-cluster-identifier or 'false' for a non-ssh Spark-cluster-identifier.
Configuration file key: (run-on-spark, use-remote-ssh)
SPARK_JOB_USER
Specifies the name of the user with whose credentials the Spark job runs.
Required: When USE_REMOTE_SSH is 'false'.
Default: Set during installation to sparkJobSubmitter.
Configuration file key: (run-on-spark, spark-job-user)
MAILMANSERVER_BLOCK_TIMEOUT_SEC
Specifies the number of seconds after which the Mailman server block times out.
Required: No
Default: 3600
Configuration file key: (run-on-spark, mailmanserver-block-timeout-sec)