I have a web application which interacts with Hadoop. (Cloudera cdh3u6) A particular user operation should launch a new Map Reduce job in the cluster.
The cluster is not a secure cluster, but it uses simple group authentication - so if I ssh to it as myself, I can launch MR jobs from the command line.
In the web application, I'm using the ToolRunner
to run my job:
MyMapReduceWrapperClass mr = new MyMapReduceWrapperClass();
ToolRunner.run(mr, null);
// inside the run implementation of my wrapper class :
Job job = new Job(conf, "job title");
//set up stuff removed
job.submit();
Currently this job is submitted as the user that launched the web application server (Tomcat) process, and that user is a special local account on this web server that doesn't have permissions to send jobs to the cluster.
Ideally I'd like to be able to get some kind of identity from the user and pass it along, so that as different users were interacting with the web app / service we could see who was invoking what jobs. Skipping over the issues of how to actually coordinate those credential services, I'm not even clear on where it would go.
I see that on a Job
I have a getCredentials()
option, but from reading about the token / Kerberos stuff in there I have the impression that this is for secured clusters (which I think we are not) - not to mention I don't think my webserver has Kerberos installed. That could be fixed though. But it also sounds like the intended use case is to add secrets that a map reduce job might want while running to access other services - and not about running the job as someone else.
I also see that on the (older?) JobConf
class I have the ability to setUser(String name)
which seems promising - even though I don't know where it would require a password or something - but I can't find much information or documentation on that function. I tried it out and it had no impact - the job was still submitted as the Tomcat user.
Are there other avenues to explore or research? I am out of key words to Google. I would prefer to not have the option "Just give your tomcat user permissions on the cluster" - I don't manage that asset and I don't expect that request to fly. If however that literally is my only option I'd like to understand why that is, so that I can argue the need, having the right information.