Amazon EMR - From Anaconda To Zeppelin
Motivation
Amazon EMR is described here as follows:
Amazon EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. You can also run these other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in Amazon EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.
Amazon EMR securely and reliably handles a broad set of big data use cases, including log analysis, web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics.
In other words, if you use common big data Apache tools, you should seriously consider Amazon EMR because it makes the configuration process as painless as it can be. That’s not to say it is always easy, though.
Here is the list of big data Apache tools currently supported in EMR:
Flink | Ganglia | Hadoop | HBase |
HCatalog | Hive | Hue | Mahout |
Oozie | Pig | Phoenix | Presto |
Spark | Sqoop | Tez | Zeppelin |
ZooKeeper |
While Amazon has excellent documentation for basic setup and while there are several great tutorials online that cover a few aspects found in this tutorial, I had no luck finding a straightforward, sequential tutorial that allowed me to do all the things I wanted to do. In fact, many steps in this tutorial were discovered by yours truly after much trial and error. I am providing this partially as a reference for myself and partially in the hopes that my work will save you countless hours and moments of downright frustration.
Here is what will be covered:
1. Create S3 Bucket
2. Create A Key Pair
3. Create A Security Group
4. Add Bootstrap Script To S3
5. Create EMR Cluster w/Anaconda, Tensorflow, Theano, & Keras
6. Setup FoxyProxy For Zeppelin
7. Setup Zeppelin Notebook
8. Set Anaconda As Default Python Interpreter In Zeppelin
9. Setup Shiro Authentication in Zeppelin
10. Setup Zepl (formerly ZeppelinHub)
Assumptions
- You already setup an AWS account. It also assumes your region is set appropriately.
- Items in
here
are buttons you click or data you input into fields or code that you type. - Items in bold are names.
- I include the $ when I’m using the Terminal. Do not actually type the dollar sign, only the code that comes after.
Now on to the tutorial. Please follow the steps sequentially.
Step 1: Create S3 Bucket
- Sign in to the AWS Management Console and open the Amazon S3 console.
- Click
Create bucket
. A new window will open. - Provide a name for your bucket under Bucket name. The name has to be unique and has to follow AWS guidelines. My bucket name for this demo is
standard-deviations-demo-bucket
. - Press the
Next
button located at the bottom right. - For this demo, we will assume the default values for properties and permissions are just fine, so click the
Next
button two more times. This will take you to Review. - Press
Create bucket
- Congratulations, you have created an S3 bucket!
Step 2: Create A Key Pair
- Open the Amazon EC2 console.
- On the left-hand side there is a list that starts with EC2 Dashboard, Events, Tags, Reports, and so on. Look for the group titled NETWORK & SECURITY. Click the 4th option called Key Pairs.
- Click
Create Key Pair
. - Enter a key pair name. I will use standard-deviations-demo-key-pair for this demo.
- Click
Create
. - Your private key file will automatically download. The base filename is the name you specified as the name of your key pair, and the filename extension is .pem.
- Save the private key file in a safe place. In practice, I move it to my .ssh directory. But to make this demo easier later on, I will move it to my home directory. Keep track of where you store your key. Use Finder to transfer the key or open Terminal and type
$ mv ~/Downloads/standard-deviations-demo-key-pair.pem ~
. - Still in Terminal, navigate to where your key is located. Again, I stored my key in my home directory so there is no need for me to change directories at this point. You will have to if you stored your key somewhere besides the home directory.
- Use the following command to set the permissions of your private key file so only you can read it:
$ chmod 400 standard-deviations-demo-key-pair.pem
. You can check permissions withls -l
. - Tada! You now have a key pair setup so you can SSH into your EC2 nodes later on.
Step 3: Create A Security Group
- Open the Amazon EC2 console.
- On the left-hand side, look for the group titled NETWORK & SECURITY. Click the 1st option called Security Groups.
- Click blue
Create Security Group
button. - Set Security group name to
cluster_security_group
. - Set description to
keep the bad guys out
- The inbound tab should already be selected. If not, select it now.
- Click
Add Rule
. - Select
SSH
from the dropdown. - Under Source there is a dropdown box that says Custom. Open the dropdown and select
MyIP
. This will automatically populate your IP address so only you will have access to your cluster. - Click the blue
Create
button on bottom right. - That’s it. All done with security group setup!
Step 4: Add Bootstrap Script To S3
- Copy or download my script called emr_configs.sh.
You may notice that we are downloading Anaconda3-4.2.0 which is not the most current version. That is by design. Version 4.3.0 upgraded to Python 3.6 which will break PySpark.
- Upload to the S3 bucket we created in Step 1 called standard-deviations-demo-bucket.
Step 5: Create EMR Cluster w/Anaconda, Tensorflow, Theano, & Keras
- Sign in to the AWS Management Console and open the Amazon EMR console.
- Click
Create cluster
. - Click
Go to advanced options
at top. - We will use the latest EMR version which is 5.5.0. Select the software you want to install. For demo purposes, I will select Hadoop 2.7.3, Spark 2.1.0, and Zeppelin 0.7.1. Leave everything else as is.
- Click blue
Next
button at bottom right. - Set the number of Core instances. I am using 1 so we have 1 Master and 1 Worker. You can change this after the cluster is created so don’t worry if you change your mind later.
- Click blue
Next
button at bottom right. - Input a name in the Cluster name field. I will use Demo Cluster.
- Click the folder icon next to S3 folder. Select standard-deviations-demo-bucket.
- Click blue
Select
button. - Expand Bootstrap Actions at bottom. Open dropdown called Add bootstrap action. Select
Custom action
. Click greyConfigure and add
button. - New window opens. In the Name field I will use emr bootstrap. Select the folder to the right of Script location and update with
emr_configs.sh
. Click blueAdd
button. Window will close. - Click blue
Next
button at bottom right. - Click blue
Next
button at bottom right. - In EC2 key pair, open dropdown and select
standard-deviations-demo-key-pair
. - Expand EC2 Security Groups at middle bottom of page.
- For Master use dropdown to select option ending in
(cluster_security_group)
. - For Core & Task use dropdown to select option ending in
(cluster_security_group)
. - Click blue
Create cluster
button at bottom right. - A dashboard opens. It takes 10+ minutes for your cluster to do its thing so be patient. Your cluster is ready when your status reads Waiting in green.
- Once your cluster is Waiting, locate Master public DNS on your dashboard. Click on the blue text that says
SSH
to the far right of that line. - A new window opens. In this window, copy the command in the grey box from step 2.
- Open Terminal.
- Assuming your key is located in your home directory, paste this command as is and hit enter.
Note: if you moved your key, you will have to update the path to where your .pem file is located.
- You will get a message saying “The authenticity of host ‘long host name’ can’t be established. Are you sure you want to continue connecting?” This is standard. Type
yes
. - You are successful if you see EMR spelled out in letters very large.
- All done. Nothing more to see here.
Step 6: Setup FoxyProxy For Zeppelin
- In Chrome, add the FoxyProxy Standard extension.
- Restart Chrome after installing FoxyProxy.
- Copy or download my script called foxyproxy-settings.xml.
- Upload to the S3 bucket we created in Step 1: Create S3 Bucket.
- Click on the
FoxyProxy icon
in the toolbar and selectOptions
. - Click
Import/Export
. - Click
Choose File
selectfoxyproxy-settings.xml
, and clickOpen
. - In the Import FoxyProxy Settings dialog, click
Add
. - FoxyProxy setup complete!
Step 7: Setup Zeppelin Notebook
- Navigate to the Amazon EMR console.
- Locate Master public DNS on your dashboard.
- Click on the blue text that says
SSH
to the far right of that line. - A new window opens. In this window, copy the command in the grey box from step 2.
- Open Terminal.
- Assuming your key is located in your home directory, paste this command as is and hit enter.
Note: if you moved your key, you will have to update the path to where your .pem file is located.
- Run the following commands in sequence:
$ cd /usr/lib/zeppelin $ sudo bash bin/install-interpreter.sh -a $ sudo bash bin/zeppelin-daemon.sh start
- Go back to Amazon EMR dashboard and select
Enable Web Connection
. - A new window pops up. Copy the command from Step 1: Open an SSH Tunnel to the Amazon EMR Master Node.
- Open a new Terminal window and paste the command from step 9 above.
NOTE 1: You may have to update the path to your key. I did since I stored my key in .ssh.
NOTE 2: This command opens a port. It will look like the command never finishes. That is normal. Do not close or exit.
- Open Chrome.
- Click the
FoxyProxy icon
at the top right and chooseUse proxies based on their pre-defined patterns and priorities
. - Go back to the Amazon EMR dashboard.
- In the same spot you clicked Enable Web Connection, the word Zeppelin should appear in blue text. Click it.
- This will open a new tab in Chrome. If all was configured properly, Zeppelin notebook should fire up.
- Congratulations, you are done with this section!
Step 8: Set Anaconda As Default Python Interpreter In Zeppelin
- Click
anonymous
in top right corner. - Click
Interpreter
. - Scroll down to the python interpreter.
- Click
Edit
. - Locate zeppelin.python.
- Set value to /home/hadoop/anaconda/bin/python
- Now find the spark interpreter.
- Locate zeppelin.pyspark.python.
- Set value to /home/hadoop/anaconda/bin/python
- That’s it! On to the next section.
Note: You can check that Anaconda is configured correctly as default by opening a new note and typing:
%python print(sys.version)
The output should read something like:
3.5.2 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:53:06) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Step 9: Setup Shiro Authentication in Zeppelin
- In EMR Terminal window, navigate to /usr/lib/zeppelin/conf.
- We need to copy two templates:
- type
$ sudo cp shiro.ini.template shiro.ini
- type
$ sudo cp zeppelin-site.xml.template zeppelin-site.xml
- type
- Secure the HTTP channel
- type
$ sudo nano shiro.ini
- Scroll down to the [urls] section
- Make sure this is set like so and save changes:
#/api/version = anon /api/interpreter/** = authc, roles[admin] /api/configurations/** = authc, roles[admin] /api/credential/** = authc, roles[admin] #/** = anon /** = authc
- type
- Secure Websocket channel
- type
$ sudo nano zeppelin-site.xml
- locate zeppelin.anonymous.allowed
- set its value to false
- simultaneously type
control o
(to save changes) - hit
enter
- simultaneously type
contol x
(to exit)
- type
- Navigate to Zeppelin directory by typing
$ cd ..
- Type
$ sudo bin/zeppelin-daemon.sh restart
- Go back to Zeppelin
- Refresh the page. You should see Login towards the top right with a green dot to the left of it.
- Click
Login
- Use any of these username, password combos:
admin password1 user1 password2 user2 password3
Note 1: usernames, passwords, and groups can be setup in shiro.ini file.
Note 2: note permissions (owners, writers, readers) can be set within note by clicking lock icon towards top right.
Step 10: Setup Zepl (formerly ZepplinHub)
- Go to Zepl
- Click blue
Sign Up
button. - Supply Username, e-mail, password and click blue
Create Account
button - Click
New
button towards top right of screen. - Select
Repository
- Give it a name and description.
- Click blue
Link
button. - A new window pops up with key information you’ll need to set environment variables.
- Let’s set those variables now. Go to EMR Terminal window and connect via SSH, if you haven’t already and type:
$ $ cd /usr/lib/zeppelin/conf $ nano zeppelin-env.sh
- Follow the instructions on Zepl for correctly updating zeppelin-env.sh. At the time of this writing, the updates looked like this:
export ZEPPELIN_NOTEBOOK_STORAGE="org.apache.zeppelin.notebook.repo.GitNotebookRepo, org.apache.zeppelin$ export ZEPPELINHUB_API_ADDRESS="https://www.zepl.com" export ZEPPELINHUB_API_TOKEN="INSERT YOUR TOKEN HERE"
- Navigate to zeppelin directory by typing
$ cd ..
- Type
sudo bin/zeppelin-daemon.sh restart
- To connect your Zeppelin notebooks and Zepl, simply create or open a notebook, run some code, and then that notebook will load automatically.
- Congrats! You are all done.
WARNING!
Make sure you Terminate your cluster when you are done so you do not incur additional charges. The nice part is that the next time you want to spin up a similar cluster, click Clone
and most of the work is already done for you. Enjoy!
That’s all for now. I hope you found this tutorial helpful.