Pig on HDInsight Server

@Slodge had prompted me to look in more detail into running Pig on HDInsight.

I’ve played with it the past using the interactive javascript console and the word count sample, but haven’t gone into much more detail and so I thought it would be nice to do something at the back of my ‘fancy’ aviation weather report using Pig and Hadoop on Windows.

In that previous post I described how I took semi-structured METAR reports, ran a M/R program on them to extract the cloudbase and temperature and then create a hive table on top; in this post I’ll use some basic Pig to examine the data in the hive table and extract the 10 reports with the highest cloudbase.

To get started I open a hadoop command shell and browse to c:\Hadoop\pig-0.9.3-SNAPSHOT\bin

I then run the pig.cmd which takes me to the grunt> prompt

image

to start with, I’ll simply read the contents of the table (as it’s not too big at this point) -

grunt>everything = LOAD ‘metarsoutput’;

grunt>dump everything;

and I get a bunch of results, here’s an extract –

image

To work with the results better I could provide details about the schema –

grunt>everything = LOAD ‘metarsoutput’ as (icao, datetime, cloudbase: int, temperature: int);

grunt> describe everything;

produces –

image

whilst

dump everything;

still produces the same results as before, but now I can ask the records to be sorted –

grunt> sorted = order everything by cloudbase desc;

grunt> dump sorted;

which produces a nicely ordered results –

image

I can also limit the number of records I want to get back -

grunt> top = limit sorted 10;

grunt> dump top;

image

 

Ok –so all of these are pretty basic examples, but as such show the basic operation of Pig on HDInsight.

To find out some more about what’s possible with PIG take a look here

@Slodge actually asked me about being able to run custom functions (UDFs) for Pig in c#, which is not currently possible, but Pig does support streaming, and that should provide a handy way ‘in’, which I’ll try to look at next.


Written by Yossi Dahan at 20:13

0 Comments :

Comment

Comments closed