Installing Impala 2.3 on Amazon EMR

I see that Impala 2.3 is only supported on Cloudera CDH 5.5 & above. Impala 2.2 can be installed on Amazon EMR as there is Bootstrap script available on GitHub & you don't require Cloudera installation.

However, I don't see any way to install Cloudera CDH 5.5 or 5.6 on Amazon EMR. I want to install Impala 2.3 so is there any way through which Impala 2.3 can be installed on Amazon EMR?

Answers


Well, my previous answer has been deleted as long as "does not provide an answer to the question". I'm not going to argue if it's better to have a partially incorrect answer to this question or if making categorical claims without foundation is a good answer :/.

In any case, I'm not giving up :)

Yes, it's possible to install "anything" on the paper.

Once you launch the EMR cluster, all instances will appear on your EC2 console. The only thing is that you have to be careful assigning the right permissions to access thru SSH to your instances. My suggestion is to create a specific security group with the access and assign this extra security group to the instances using the Advanced configuration of the cluster. By having the proper configuration, you could ssh into any instance and install anything (you should be able to scp any file or download from internet if you have the proper configuration of your VPC). Note that the user will be "hadoop" instead "ec2-root" but this is documented on the EMR user guide.

Keep in mind that the cluster is "Terminated" so, the EMR instances are volatile and the installation is not going to survive the cluster termination.

On the other hand, using the latest versions of EMR AMIs and the latest capabilities of AWS (I think that it was all the time the case, but, it doesn't matter now) you should be able to create some actions on the bootstrap and install anything you want.

Using the "Advanced configuration" of your cluster, you can access to the "Bootstrap" actions to be executed on your cluster. You could even have different actions depending on the node type (master, core, tasks). You should store your scripts (and/or jar files) on an S3 bucket and made this bucket available to your cluster. On the paper, you could install Impala on these EC2 instances comprising the EMR cluster but I'm not sure if this will work.

For more information, you can read http://docs.aws.amazon.com//emr/latest/ManagementGuide/emr-plan-bootstrap.html

And for a previous version of EMR AMI and not so recent version of Impala you can read https://github.com/awslabs/emr-bootstrap-actions/tree/master/impala

Thanks Mark, you forced me to elaborate better my comment.


No, it is not possible to "install" anything on EMR because it's a PaaS provided by AWS. But if your goal is to run a newer version of Impala on AWS, there is an AWS Quick Start path for installing CDH 5.x (including Impala) that makes the process relatively easy.

http://aws.amazon.com/quickstart/


Need Your Help