Arjuna's Blog: 2011

Wednesday, October 26, 2011

ECL - A Practical Language for a Practical Data Programmer

Let us say you have a list of objects that contain Person information like First Name, Last Name, Phone Number, Date of Birth and Address. It is required to transform each object into a new object which contains a Name field, Phone Number, Date of Birth and Address. Where the Name field is the full name of the person, and the rest of the fields are a copy of the original objects corresponding field values.

If we attempt to write the solution in Java, the function to perform the conversion for each Person object will look something like:

public PersonEx convert(Person person) {
PersonEx personEx = new PersonEx();
personEx.name = person.lastName + " " + person.firstName;
personEx.phoneNumber = person.phoneNumber;
personEx.address = person.address;
personEx.dateOfBirth = person.dateOfBirth;
return personEx;
}

Now, the equivalent code in ECL is :

PersonEx convert(Person person) := TRANSFORM
SELF.name := person.firstName + ' ' + person.lastName;
SELF := person;

end;

The ECL code looks much simpler but achieves the same objective. Let me explain. ECL automates programming steps as much as possible based on the information that is available to the compiler. In the above example, ECL knows that the return value is a record of type "PersonEx". Hence, the keyword "SELF" is equivalent to "PersonEx self = new PersonEx();" in Java. The instance creation and association is implicit. This eliminates the need to type in all the extra code that greatly simplifies your programming task.

There is some more implicit magic here. What does the statement "SELF := person;" do? You guessed right. It is equivalent to explicitly writing the following code:

SELF.phoneNumber := person.phoneNumber;
SELF.address := person.address;
SELF.dateOfBirth := person.dateOfBirth;

ECL, by introspection, compares the two objects and automatically initializes the variables that have the same names that have not been previously initialized.

To summarize:

ECL, provides us many more features to make our programming lives easier. The following are some examples:

PROJECT(persons, convert(LEFT));

The "LEFT" indicates a reference to a record in "persons". The "project" declares that for every record in the data set "persons", perform the "convert" transformation. Implementing something similar in java would need an iterator, loop and several variable initialization steps.

OUTPUT(persons(firstName = 'Jason'));

This action returns a result set "persons" after applying the filter "firstName='Jason'".

As you can seen in the above examples, ECL has been designed to be simple and practical. It enables data programmers to quickly implement their thoughts into programming tasks by keeping the syntax simple and minimal. Java is used in the examples to show you how ECL's contracting style can be used effectively to solve data processing problems. It does not mean that Java is not a practical language. It simply means that ECL is abstract enough to shield the programmer from complex language structures as in Java.

Tuesday, October 18, 2011

Parsing a Web Server Log File on Thor

One of the major benefits of Thor, the HPCC Systems data refinery component, is its ability to process large volumes of data, like data from log files without the need to perform pre-processing outside of the platform. In this blog entry you will be introduced to ECL's powerful regular expression library that makes data extraction look easy. ECL is the programming language that is used to program ETL (Extract, Transform and Load) jobs on Thor.

Understanding the Input Content

Typical log file data can be classified as semi-structured content as most log files have records and columns that can be extracted as against unstructured data, where record and column boundaries are hard to identify.

A web server log file is an example of a semi-structured file. A typical line in a web server log file that conforms to the standard common log format reads as:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

ECL, the powerful distributed data programming language, has built in libraries to process both semi structured data (log files etc) and unstructured data (emails, html content, twitter feeds etc). This enables you to maximize the parallel processing capability of the platform right out of the gate. No holding back.

Designing for Extraction

Let us identify the tokens that are present in a web log file. This will enable us to code a token parser to successfully extract web server log data from a file that implements the common log format:

Writing ECL Code

We will now write an ECL program to parse a file that contains lines of data where each line is formatted as shown above.

Step 1 - Declare record structure to store the raw input



//Declare the record to store each record from 

//the raw input file. Since the file has lines of log data,

//the record will need one string field to store each line.

RawLayout := record

string rawTxt;

end;



//Declare the file. In this example, 

//for simplicity, the content is shown inline

fileRaw := dataset([{'127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif   HTTP/1.0" 200 2326'}], RawLayout);

Step 2 - Declare regular expression patterns to parse the tokens from each raw input record

Beware! This is where you would need to pick up a regular expressions book (or web site) if you are not already familiar with the topic.



pattern alphaFmt := pattern('[a-zA-Z]')+;

pattern alphaNumbFmt := pattern('[a-zA-Z0-9]')+;

pattern sepFmt :=  ' '+;

pattern numFmt := pattern('[0-9]')+;

pattern ipFmt := numFmt '.' numFmt '.' numFmt '.' numFmt; 

pattern identFmt := '-';

pattern authuserFmt := alphaNumbFmt;

pattern hoursFromGMT := pattern('[\\-\\+]') numFmt;

pattern yearFmt := numFmt;

pattern monthFmt := alphaNumbFmt;

pattern dayFmt := numFmt;

pattern hoursFmt := numFmt;

pattern minutesFmt := numFmt;

pattern secondsFmt := numFmt;

pattern dateFmt := '[' dayFmt '/' monthFmt '/' yearFmt ':' hoursFmt ':' minutesFmt ':' secondsFmt ' ' hoursFromGMT ']';

pattern cmdFmt := alphaFmt;

pattern notQuoteFmt := pattern('[^"]')*;

pattern paramsFmt := opt('?' notQuoteFmt);

pattern urlFmt := pattern('[^"\\?]')*;

pattern httpMethodFmt := 'HTTP/' numFmt '.' numFmt;

pattern requestFmt := '"' cmdFmt urlFmt paramsFmt httpMethodFmt '"';

pattern statusFmt := numFmt;

pattern bytesFmt := numFmt;



pattern line :=  ipFmt sepFmt identFmt sepFmt authUserFmt sepFmt dateFmt sepFmt requestFmt sepFmt statusFmt sepFmt bytesFmt;

The declarations above are easy to follow if you know your regular expressions. These pattern declarations are used by the parser to extract tokens from the raw input record and map them to a (structured) model that can then be used to perform further processing.

The pattern line declaration specifies how each line in the file should be parsed and interpreted as tokens.

Step 3 - Declare the new record that will contain the extracted tokens



LogLayout := record 

string ip := matchtext(ipFmt);

string authUser := matchtext(authuserFmt);

string date := matchtext(dateFmt);

string request := matchtext(requestFmt);

string status := matchtext(statusFmt);

string bytes := matchtext(bytesFmt);

end;

The matchtext() function is used to extract the specific token you are interested in from the parser.

Step 4 - Parse the file and output the result



logFile := parse(fileRaw,

rawTxt,

line,

LogLayout,first);



output(logFile);

The parse function accepts the file to parse, the field in the RawRecord to parse, the token format for each line, the output record layout and a flag as parameters. The flag value first indicates "Only return a row for the first match starting at a particular position".

Step 5 - Submit the program to Thor and view the results

Once the program is submitted (run), the output should look like:

There a few variations of this program that you can implement.

Some Variations

What if the data source is actually a sprayed file?



fileRaw  := dataset('~.::myfile',RawLayout,csv(separator('')));

You will simply replace the fileRaw declaration with the one above. "~.::myfile" is the logical name of the sprayed file.

How can you record error lines that do not match the specified pattern? In other words malformed input.

ErrorLayout := record

   string t := fileRaw.rawTxt;

end;



e := parse(fileRaw,

rawTxt,

line,

ErrorLayout,NOT MATCHED ONLY);



output(e);

As you have seen, the ECL language and Thor provides you with a powerful framework to accomplish your ETL tasks. You can learn more about Thor and ECL at http://hpccsystems.com.

Wednesday, July 20, 2011

MapReduce Vs. HPCC Platform

Let us assume that we have to solve the following problem:

"Find the number of times each URL is accessed by parsing a log file containing URL and date stamp pairs"

Solution 1

This is easy. Traverse the log file, one line at time. Record the count for every unique URL that is encountered. This is easily accomplished in a Java program using a for loop and a hashmap.

Solution 1 works great if the input file is small. What if you are dealing with a large volume of data? Tera, giga, peta etc. The sequential processing is not a practical option for dealing with Big Data.

Solution 2

In this solution, the input is split into multiple <key, value> pairs and fed to a map function in parallel. The map function then converts it into intermediate <key, value> pairs. In our example, the map function will output <url, 1> pair for for every input <key, value>. Where the url is the url identified from the input and 1 is for each unique occurrence.

Before Map Step:

Key = 1 Value = http://hpccsystems.com
Key = 2 Value = http://hpccsystems.com
Key = 3 Value = http://hpccsystems.com/developers
Key = 4 Value = http://hpccsystems.com/downloads

After Map Step:

Key = http://hpccsystems.com Value = 1
Key = http://hpccsystems.com Value = 1
Key = http://hpccsystems.com/developers Value = 1
Key = http://hpccsystems.com/downloads Value = 1

The data is then sorted by the intermediate keys (<url1, 1>, <url2,1> etc) so that all occurrences of the same key are grouped together. Every unique key identified and all the values is then passed to a reduce function. The reduce function can be called multiple times for the same key. In our example, the reduce function will simply count the occurrences for the unique url in the reduce step - <url, total count>

After Reduce Step:

Key = http://hpccsystems.com Value = 2
Key = http://hpccsystems.com/developers Value = 1
Key = http://hpccsystems.com/downloads Value = 1

This process of solving the problem by using the Map and Reduce steps is called MapReduce. The MapReduce paradigm was made famous by Google. Google used to process large volumes of crawled data, by spreading the map and reduce jobs to several worker nodes in a cluster to execute it in parallel to achieve high throughput. Another well known MapReduce framework is Hadoop.

The major drawback of the MapReduce paradigm is the fact that it is intended to process batch oriented jobs. It is suitable for ETL (Extraction, Transformation and Loading) but not for online query processing. So Hadoop and its extensions Hive and HBase, have been built on top of a batch oriented framework that is not really meant for online query processing. The other drawback is the fact that every task needs to be defined in terms of a Map and Reduce step so that work can be distributed across nodes. In most cases it will take several Map and Reduce steps to solve a single problem.

Solution 3

SELECT url, count(*) FROM urllog GROUP BY url

How easy is that? No map or reduce logic. SQL, with it declarative nature, lets us concentrate on the What logic. However, SQL databases do not adhere themselves to BigData processing. Typical BigData processing involve several clustered nodes processing information to produce an end result.

HPCC Systems ECL, is specifically designed to overcome the limitations of SQL. ECL is a truly declarative language, that is somewhat similar to SQL, and lets you solve the problem by expressing the code as the What rather than the How. The complexity around clustering is well encapsulated by the ECL language and hence is never really exposed to the programmer.

Simple ECL Code to find the count for each unique URL:

//Declare the input record structure. Assume an input CSV file of URL,Date fields

rec := RECORD
STRING50 url;
STRING50 date;
END;

//Declare the source of the data
urllog := DATASET('~tutorial::AC::Urllog',rec,CSV);

//Declare the record structure to hold the aggregates for each unique URL
grouprec := record
urllog.url;
groupCount := COUNT(GROUP);
end;

//The TABLE function is equivalent to an SQL SELECT command
//The following declaration is used to create a Cross Tab aggregate of the SELECT equivalent shown above
RepTable := TABLE(urllog,grouprec,url);

//Output the new record set
OUTPUT(RepTable);

Sample Input:


Sample Output:

The HPCC Platform will distribute the work across the nodes based on the most optimal path that is determined at runtime.

As compared to MapReduce frameworks like Hadoop, the HPCC Platform keeps it simple. Let the platform determine the best work distribution across nodes so that developer solves the What rather than worry about How the work is distributed across nodes. Further, the HPCC Platform comprises of two unique components that are optimized to solve specific problems. The Thor is used as an ETL (Extraction, Transformation, Loading and Linking) engine and the Roxie is used as the online query processing engine.

Monday, July 11, 2011

ECL - Part III (Declarative, Attributes (aka Definitions) and Actions)

ECL is a declarative programming language. In declarative programming, we express the logic of computation and do not describe the control flow or the state. For example, in Java you would:

x = 1;

y = 2;

System.out.println(x+y);

Java is an imperative programming language, where the sequence of steps you write dictates to the compiler in what order the code has to be executed. In the example, x = 1 is executed first, y = 2 next, and then System.out.println. Here, the programmer controls the sequence and state.

For the same code, in ECL you would:

x:= 1; //An attribute declaration
y:= 2;
output(x+y); //An action

The code looks similar to the Java code. The difference is that the steps x := 1 and y := 2 do not perform a state assignment operation nor do they define the control flow. It simply means, when there is a need to use x, substitute the value 1. Until there is a need, do not do anything.

Well, you might now ask the question - Is being a declarative programming language really important?

The answer is "YES". In parallel computing, it is best left to the platform to determine the optimal sequence (or parallel) steps to execute to reach an end goal. Performance and scale is everything.

Attributes and Actions

Attribute (aka Definition): A declaration such as x := 1; is representative of an attribute declaration where x is an attribute. It is important to note that an attribute is not equivalent to a variable state assignment, but rather a declaration. The difference is that a declaration, postpones the evaluation until the attribute is actually used.
Action: An action such as output(x+y); instructs the platform to execute a code snippet that would be expected to produce a result.

In ECL, every statement is either an attribute (aka definition) or an action. Declarative thinking, helps Big Data developers to worry about the problem solution (the What), rather than the need to worry about the sequence of steps, parallel programming techniques and state assignment (the How). Being declarative is another reason why ECL is a powerful language for Big Data programming.

Friday, July 08, 2011

ECL - Part II (ECL IDE Basics and Transformations)

In Part I of the ECL blog series we were introduced to the HPCC platform, how to load a data file and display the contents using ECL. In Part II, we will continue from where we left off and learn about transformations in ECL. This will give you a glimpse of the power of the ECL language and why it is the best language to handle data (Big or Small) manipulation.

Before we begin to code transformations, let us spend some time understanding the features/views available in the ECL IDE, the tool used to write ECL code:

Builder - Use the builder to edit your ECL code, build and submit it for execution.
Submit/Compile - Is used to compile an ECL code file and submit it as a job for execution on the cluster
Output Results - Executed ECL code results can be viewed here.
Syntax Errors - Check if your ECL code is free of syntax errors using the compile option (F7). The Syntax Errors view displays design time syntax errors.
Runtime Errors - The error log view displays the errors that occur when ECL code is executed on the cluster.
Workunits - Displays all the ECL jobs that have been executed on a cluster. It is conveniently categorized by days, months and years.
Repository - This synonymous to projects in other IDEs. Shows location of files on local storage. For me, it can we found on the hard disk at "C:\Users\Public\Documents\HPCC Systems". It can be configured to point elsewhere by changing the IDE preferences.
Workspace - Is a logical work environment that can be used to enhance your programming experience.
Datasets - List the available data sets on the cluster. It is convent to select the data set and copy the label so as to use it in the code

Read more about the ECL IDE and Client Tools here.

Now back to coding transformations. For the transformation example, we are going to work with the OriginalPerson dataset from Part I and transform the data to create a new TransformedPerson dataset, which is a copy of the OriginalPerson dataset with the First, Middle and Last names converted to upper case.

Open a new builder window (CTRL+N) and type in the following code:

IMPORT Std;
//Declare the format of the source and destination record
Layout_People := RECORD
STRING15 FirstName;
STRING25 LastName;
STRING15 MiddleName;
STRING5 Zip;
STRING42 Street;
STRING20 City;
STRING2 State;
END;

//Declare reference to source file
File_OriginalPerson :=
DATASET('~tutorial::AC::OriginalPerson',Layout_People,THOR);

//Write the Transform code
Layout_People toUpperPlease(Layout_People pInput)
:= TRANSFORM
SELF.FirstName := Std.Str.ToUpperCase(pInput.FirstName);
SELF.LastName := Std.Str.ToUpperCase(pInput.LastName);
SELF.MiddleName := Std.Str.ToUpperCase(pInput.MiddleName);
SELF.Zip := pInput.Zip;
SELF.Street := pInput.Street;
SELF.City := pInput.City;
SELF.State := pInput.State;
END ;

//Apply the transformation
TransformedPersonDataset :=
PROJECT(File_OriginalPerson,toUpperPlease(LEFT));

//Output it as a new Dataset
OUTPUT(TransformedPersonDataset,,'~tutorial::AC::TransformedPerson',
OVERWRITE);

The important step is a call to the Project function. In this particular case it means:

"Transform Dataset File_OriginalPerson to TransformedPersonDataset By applying transformation toUpperPlease for each record of LEFT dataset = File_OriginalPerson"

LEFT is analogous to the LEFT join syntax in SQL. In this case it is the File_OriginalPerson.

Compile and Submit the code. View the results in the Output Results view.

This is some powerful code. ECL lets you solve complex data manipulation problems using simple and concise code. This is only tip of the iceberg. Read the ECL programmers guide and ECL Language reference to discover ECLs immense power.

Wednesday, July 06, 2011

ECL - Part I (loading data)

One of the cool features of the HPCC platform is its ability to extract, transform and load Big Data (tera bytes to peta bytes). At the core of the HPCC platform is the powerful and simple ECL (Enterprise Control Language) programming language. Part I of the ECL blog, shows you how to load data and display the contents. In essence, have the data ready for further data manipulation.

Before we begin to get our feet wet and drool over the code, it is imperative to spend a few minutes understanding the HPCC platform architecture. The following high level architecture diagram, provides you with a view of the important components:

THOR - The Thor cluster performs the Loading, Extraction and Transformation of the data. It is used to load Big Data (unstructured, semi-structured or structured), transform it and optimize it for querying.

ROXIE - The Roxie cluster is optimized to perform queries with very high concurrency. Typically, processed data from a Thor cluster is exported to a Roxie cluster to enable real time, fast and highly concurrent query processing.

ESP - The ESP provides you with a simple web services interface, that is used to access the Roxie queries.

Now back to the coding. How do we load the data? The HPCC Platform has a built in utility called ECL Watch. Data loading is one of the functions (among many) that ECL Watch provides. The following step by step tutorial takes you through the process:

1) For the sake of sanity, we will assume that you have been able to download and install the HPCC VM. If not, please proceed to do so and read the HPCC VM Install Guide.

2) Now download the sample data file containing person information data- first name, last name etc. Once downloaded, extract the contents of the zip file and store it at a known location.

3) Point your browser to the ECL Watch and use the upload/download file link to upload the file to the landing zone

Browse to the sample person file you downloaded, select and upload.

4) Spray the file contents to all the nodes across the cluster. Again, use the ECL Watch utility to do this.

The label is a logical name, AC stands for my initials. It really can be anything. Choose the file that you uploaded in step 3, set the record size to 124 (record size of the person file) and submit for it to be sprayed.

If successful, you should see a results page that looks like:

Click on the View Progress to view the progress of the spray

5) Download, install and configure the ECL IDE if you have not done so already.

ECL IDE preferences:

Enter your VM IP in the "Server" input.

Save the preference

6) Now, you are ready to write ECL Code. While in the ECL IDE, press CTRL+N to open a new work unit. Type in the following code:

Layout_People := RECORD
STRING15 FirstName;
STRING25 LastName;
STRING15 MiddleName;
STRING5 Zip;
STRING42 Street;
STRING20 City;
STRING2 State;
END;

File_OriginalPerson :=
DATASET('~tutorial::AC::OriginalPerson', Layout_People, THOR);

//Here change the AC to whatever you used to name the label in step 4

OUTPUT(File_OriginalPerson);

7) Syntax check (f7) and Submit/Execute the code to see the following results

That is it for now. In my next blog post, we will be looking at some of the features of the ECL IDE and ECL Language. In the process, we will also expand upon the person example and learn about transformations, indexing and sorting.