RunOnSpark Arguments - Aster Analytics

SPARKCODE

Specifies the name of a function and its arguments. The function must be in the jar file specified by the APP_RESOURCE argument.

Required: Yes

OUTPUTS

Specifies the names and data types of the output columns.

Required: No

MEM_LIMIT_MB

Specifies the maximum number of megabytes to allocate for the data transfer buffers.

Required: No

Default: 32 MB

Configuration file key: (run-on-spark, mem-limit-mb)

TIMEOUT_SEC

Specifies the time, in seconds, after which the query times out (is canceled). If timeout_value is 0, timeout handling is disabled.

Required: No

Default: Set to '0' during installation process.

Use the default value only for valid RunOnSpark queries. Disabling timeout handling can cause RunOnSpark queries with malformed parameters to wait forever.

Configuration file key: (run-on-spark, timeout-sec)

STATUS_INTERVAL_SEC

Specifies the time interval, in seconds, after which to check the Spark job status. If status_interval_value is 0, status checking is disabled.

Required: No

Default: Set to '0' during installation process.

Use the default value only for valid RunOnSpark queries. Disabling status checking can cause RunOnSpark queries with malformed parameters to wait forever.

Configuration file key: (run-on-spark, status-interval-sec)

APP_RESOURCE

Specifies the location of the Teradata Aster Spark jar file on the Hadoop/Spark cluster. The aster_jar_location can be either an HDFS location, shared Linux drive, or local Linux drive. If you wrote your own functions and put them into your own jar file, aster_jar_location must be the location of your jar file.

Required: Unless you provide this information in the configuration file.

Default: Set to 'hdfs://user/sparkJobSubmitter/sparkJobSubmitter/aster-spark-extension-sparkversion.jar' (HDFS location of Aster Spark jar) during installation process.

Configuration file key: (spark-params, app-resource)

JARS

Specifies additional jar files. If you wrote your own functions that invoke the , specify the location of the Teradata-supplied aster-spark-extension-sparkversion.jar file.

Required: No

Default: None

Configuration file key: (spark-params, jars)

EXTRA_SPARK_SUBMIT_OPTIONS

Specifies extra options to include when submitting the Spark job. The syntax of option_value_pair is:

option value

Required: No

Default: None

Configuration file key: (spark-params, extra-spark-submit-options)

SPARK_CLUSTER_ID

Specifies the name of the Hadoop/Spark cluster in the configuration file to be used to execute the spark function that SPARKCODE specifies. A Spark instance can have multiple entries, which have different parameters in the configuration file. A query can have multiple RunOnSpark functions, which reference either the same or different Hadoop/Spark clusters.

Required: No

Default: Set during installation to a Spark-cluster-identifier created during installation; for example: 'AsterSpark_namenode_site.json-socket-noSSh'.

Do not use the default value without overriding any of its settings unless you know what the default Spark-cluster-identifier is and are sure you want all the installation-supplied settings associated with it. Teradata recommends that you start by explicitly specifying all the settings you want. When those work, you can put your desired settings into your own Spark-cluster-identifier, make it the default in the configuration file, and omit SPARK_CLUSTER_ID from your RunOnSpark queries.

Configuration file key: 'default' (the default cluster specified in the configuration file)

For information about the configuration file, refer to Configuration File: spark.config.

DATA_TRANSFER

Specifies the method for transferring data to and from Spark:

'file':
Transfer the data to a distributed set of files (for example, on HDFS). The Spark application reads data from these files and writes output to a set of files.
'socket-persist':
Transfer the data directly to and from the Spark application through sockets and persist the sent data to files on the Spark side.

Required: No

Default: Set during installation to 'file' or 'socket-persist' for each Spark-cluster-identifier created during installation.

Configuration file key: (run-on-spark, data-transfer)

PERSIST_LOCATION

Specifies HDFS location to use for results received from Spark.

Required: No

Default: Set during installation to the HDFS directory; for example: 'hdfs://user/sparkJobSubmitter/tmp'.

Configuration file key: (run-on-spark, persist-location)

SPARK_PROPERTIES

Specifies additional Spark properties to apply when running the Spark job. An example of spark_property_name is spark.executor.memory, which specifies the amount of memory to use for each executor process.

Required: No

Default: None

Configuration file key: (spark-properties, spark-property-name)

REST_URL

Specifies the Spark REST URL that the RunOnSpark master instances uses to submit jobs, query their status, and cancel them if they run beyond their timeout limit.

Required: Unless you provide this information in the configuration file.

Default: Set during installation to a 'http://namenode/ws/v1/cluster/apps/'.

Configuration file key: (spark-params, base-rest-url)

SSH_HOST

Specifies the name of the user with whose credentials the Spark job runs and the host where the Spark job starts. The user extensibility must have OpenSSH access to user@host to run the commands SPARK-SUBMIT and YARN.

Required: When USE_REMOTE_SSH is 'true'.

Default: Set during installation to 'sparkJobSubmitter@namenode'.

Configuration file key: (run-on-spark, ssh-host)

IDENTITY_FILE

Specifies the identity file (.pem) path to use with remote ssh when you do not want to enable passwordless ssh to the Hadoop/Spark cluster.

Required: When passwordless ssh is disabled and USE_REMOTE_SSH is 'true'.

Default: Set during installation to '/home/beehive/config/spark/namenode/IDENTITYFILE'.

Configuration file key: (run-on-spark, identity-file)

SPARK_SUBMIT_COMMAND

Specifies the command that starts the Spark job.

Required: No

Default: Set during installation to '/usr/bin/spark-submit' for an ssh Spark-cluster-identifier or '/home/beehive/config/spark/namenode/client/bin/spark-submit' for a non-ssh Spark-cluster-identifier.

Configuration file key: (run-on-spark, spark-submit)

YARN_COMMAND

Specifies the yarn command to invoke.

Required: No

Default: Set during installation to '/usr/bin/yarn'.

Configuration file key: (run-on-spark, yarn-command)

HADOOP_JARS_LOCATIONS

Specifies paths to the Hadoop jar files, used when DATA_TRANSFER is 'file'.

Required: No

Default: Set during installation to '/home/beehive/config/spark/namenode/hadoopjars'.

Configuration file key: (run-on-spark, hadoop-jars-locations)

HADOOP_CONF_LOCATION

Specifies the local path to the Hadoop configuration file, used when DATA_TRANSFER is 'file'.

Required: Required when DATA_TRANSFER is 'file' or when USE_REMOTE_SSH is 'false'.

Default: Set during installation to '/home/beehive/config/spark/namenode/hadoopConf'.

Configuration file key: (run-on-spark, hadoop-config-location)

SPARK_CONF_LOCATION

Specifies the local path to the Spark configuration file, used when USE_REMOTE_SSH is 'false'.

Required: Required when USE_REMOTE_SSH is 'false'.

Default: Set during installation to '/home/beehive/config/spark/namenode/sparkConf'.

Configuration file key: (run-on-spark, spark-config-location)

KERBEROS_AUTHENTICATION

Specifies whether to enable Kerberos authentication on the Hadoop system.

Required: No

Default: None

If you omit this argument, the Aster Database uses the Hadoop configuration files to determine whether Kerberos authentication is enabled on the Hadoop system.

Configuration file key: (kerberos-authentication)

SPARK_JOB_USER_KEY_TAB

Specifies the key tab file location from which the user runs the Spark job when Kerberos authentication is enabled on the Hadoop system.

Required: When Kerberos authentication is enabled on the Hadoop system.

Default: None

Configuration file key: (run-on-spark, spark-job-user-keytab)

FILE_ACCESS_KEY_TAB

Specifies the key tab file location for the accessing HDFS.

Required: When Kerberos authentication is enabled on the Hadoop system and the data transfer method is 'file'.

Default: None

Configuration file key: (file-access-keytab)

KINIT

Specifies the Kerberos kinit command when Kerberos authentication is enabled on the Hadoop system.

Required: When Kerberos authentication is enabled on the Hadoop system.

Default: Set during installation to '/usr/bin/kinit'.

Configuration file key: (run-on-spark, kinit-command)

WORKERS_IP_ADDRESSES

Specifies the IP addresses of the vworkers. The ip_address_start is the string with which all vworker IP addresses start. You must specify this argument if the Aster nodes have multiple IP addresses that are neither public nor accessible from the Hadoop nodes.

Required: No

Default: None

Configuration file key: (workers-ip-addresses)

LOGGING_LEVEL

Specifies the amount of logging information generated (for Spark) and logged to the Teradata Aster SQL-MapReduce® logs (for the RunOnSpark function):

'INFO': Log only information.
'WARNING': Log information and warnings.
'ERROR': Log information, warnings, and errors.
'DEBUG': Log as much information as possible, for help troubleshooting.

Required: No

Default: Set during installation to 'INFO'.

Configuration file key: (run-on-spark, logging-level)

DELIMITER

Specifies the character to use as a field separator when transferring data to and from Spark.

Required: No

Default: Set during installation to '\t'.

Configuration file key: (run-on-spark, delimiter)

NULL_STRING: Specifies the string to use to represent a null value.; Required: No; Default: Set during installation to 'NULL'.; Configuration file key: (run-on-spark, null-string)
USE_REMOTE_SSH: Specifies whether remote ssh is required to start the Spark job on the remote Hadoop/Spark cluster.; Required: No; Default: Set during installation to 'true' for an ssh Spark-cluster-identifier or 'false' for a non-ssh Spark-cluster-identifier.; Configuration file key: (run-on-spark, use-remote-ssh)
SPARK_JOB_USER: Specifies the name of the user with whose credentials the Spark job runs.; Required: When USE_REMOTE_SSH is 'false'.; Default: Set during installation to sparkJobSubmitter.; Configuration file key: (run-on-spark, spark-job-user)
MAILMANSERVER_BLOCK_TIMEOUT_SEC: Specifies the number of seconds after which the Mailman server block times out.; Required: No; Default: 3600; Configuration file key: (run-on-spark, mailmanserver-block-timeout-sec)