Preparing for Installation - Aster Analytics

Teradata AsterĀ® Spark Connector User Guide

Product
Aster Analytics
Release Number
7.00.00.01
Published
May 2017
Language
English (United States)
Last Update
2018-04-13
dita:mapPath
dbt1482959363906.ditamap
dita:ditavalPath
Generic_no_ie_no_tempfilter.ditaval
dita:id
dbt1482959363906
lifecycle
previous
Product Category
Software
  1. Ensure that these are running:
    • Aster Database version AD 6.20 or later, using /home/beehive/toolchain/x86_64-unknown-linux-gnu/python-2.7.3/bin/python
    • Hadoop/Spark cluster version HDP 2.4.2 or CDH 5.5.2
    • Spark version 1.6.1 (for HDP 2.4.2) or 1.5 (for CDH 5.5.2)
  2. Ensure that these Hadoop YARN container settings have these values:
    Setting Value
    Memory allocated for all YARN containers on a node Maximum available value
    Minimum container size 512 MB
    Maximum container size 2048 MB
    Percentage of physical CPU allocated for all containers on a node Maximum available value
    Number of virtual cores Maximum available value
    Maximum container size Maximum available value
  3. Ensure that the Hadoop nodes have enough free disk space. To ensure that Hadoop cleans up cache and log files before running out of disk space, configure these settings to values appropriate to your cluster:
    yarn.nodemanager.localizer.cache.cleanup.interval-ms
    yarn.nodemanager.localizer.cache.target-size-mb
  4. Verify that Aster Database and Hadoop have network connectivity:
    1. Verify that the Aster Database queen and vworker nodes can resolve the host names of the Hadoop/Spark node and vworkers:

      From the queen and each vworker, at the command prompt, enter:

      ping -c 3 -w 10 hadoop_node_host_name

      If you see the following result, refer to Unknown Hadoop Spark host name.

      ping: unknown host hadoop_node_host_name
    2. Verify that the setuid bit of /bin/ping is set (so that users can run ping).
  5. Create a cluster-wide user on the Hadoop/Spark cluster (hereafter called sparkJobSubmitter) and authorize this user to submit Spark jobs.
  6. If Kerberos authentication is enabled on the Hadoop system:
    1. Create a Kerberos sparkJobSubmitter principal and keytab credentials for sparkJobSubmitter. For example, if sparkJobSubmitter is sparkJobSubmitter:
      # as root on Hadoop master node, create sparkJobSubmitter principal:
      kadmin.local
      addprinc sparkJobSubmitter/hdp101m1.labs.teradata.com@HDP101.HADOOP.TERADATA.COM
      # supply sparkJobSubmitter's password
      # create sparkJobSubmitter's keytab
      ktadd -k /home/sparkJobSubmitter/sparkJobSubmitter.keytab sparkJobSubmitter/hdp101m1.labs.teradata.com@HDP101.HADOOP.TERA
      quit
      # give sparkJobSubmitter ownership of its keytab
      chown -R sparkJobSubmitter /home/sparkJobSubmitter/sparkJobSubmitter.keytab
    2. Copy /etc/krb5.conf to the Aster cluster. For example:
      scp /etc/krb5.conf root@Aster_queen:/etc/krb5.conf.new
      # As root on the Aster queen node, clone this kerberos conf file
      cp /etc/krb5.conf /etc/krb5.conf.old
      cp /etc/krb5.conf.new /etc/krb5.conf
      ncli node clonefile /etc/krb5.conf
  7. If sparkJobSubmitter will use ssh to submit Spark jobs, enable either passwordless-ssh or identity-file-based ssh from the Teradata Aster vworkers nodes to the Hadoop/Spark cluster.
  8. Create a Hadoop distributed file system (HDFS) directory for this user (for example, /user/sparkJobSubmitter ). One of the system-generated configuration scripts assumes that this user can create HDFS directories under this directory and copy files to them. For example, the user must be able to execute commands such as:
    hadoop fs -mkdir -p /user/sparkJobSubmitter
    hadoop fs -chown -R sparkJobSubmitter /user/sparkJobSubmitter
  9. Create a user named beehive on the Hadoop/Spark cluster.
  10. Grant beehive read access to the Hadoop Spark assembly jar and topology_mappings.data files and write access to the /tmp directory (to which the installation script copies the aster-spark-extension*.jar file). The locations of the Hadoop Spark assembly jar and topology_mappings.data files can be system-specific. The configureAsterSpark script expects to access these files at these locations:
    Hadoop Distribution Spark Assembly jar Location Topology Mappings Location
    HDP /usr/hdp/version/spark/lib/ /etc/hadoop/conf/
    CDH /var/.../spark/lib/ or /opt/.../spark/lib/ /etc/hadoop/conf/
  11. If Aster and Spark are on different clusters, ensure that the Aster Database queen and vworker nodes can resolve the host names of your Hadoop node and vworkers. On most platforms, you do this using a form of Domain Name Server. On other platforms, one way to do this is:
    1. On the Aster Database queen node, edit /etc/hosts, adding the IP addresses and host names of your Hadoop nodes and your Aster queen and vworker nodes.
    2. Copy the edited file to your Hadoop nodes, using this command:
      ncli node clonefile /etc/hosts
  12. On the Aster database queen and vworker nodes, grant beehive the privilege to execute this command:
    /bin/chown extensibility\:extensibility/home/beehive/config/spark/*/IDENTITYFILE
    The procedure for granting this privilege depends on your platform and environment. One example is:
    1. On the Aster queen, as user root, enter:
      visudo -f /etc/sudoers
    2. Add this line:
      beehive ALL= NOPASSWD: /bin/chown extensibility\:extensibility/home/beehive/config/spark/*/IDENTITYFILE

      This line lets the user beehive execute the chown command from any terminal on the queen node without specifying a password.

    3. Search for 'Defaults requiretty' line and comment out that line:
      # Defaults    requiretty
    4. Save /etc/hosts and exit visudo.
    5. Let vworkers transfer ownership of the id_rsa identity file from the user beehive to the user extensibility, using this command:
      ncli node clonefile /etc/sudoers
      This ability is important because:
      • To access Spark, the Aster Database uses RunOnSpark queries, which use vworkers to submit Spark jobs. The vworkers submit jobs with the user ID of the user created in steps 5 and 6.
      • The vworkers can submit Spark jobs with OpenSSH. On the vworker, the user extensibility executes RunOnSpark tasks, which can include submitting Spark jobs. For security, the ownership and access to any identity file must be transferred and limited to the user extensibility.

      The user beehive needs the ability to execute the chown command only during Aster Spark Connector configuration. After configuration, you can revoke this privilege by logging onto the queen node as the root user, entering visudo -f /etc/sudoers, deleting the line that you added in step b, adding the line that you deleted in step c, saving the file, exiting visudo, and cloning the modified file to vworkers.