Arjuna's Blog

Monday, December 05, 2016

ECL as a Data Flow Language

HPCC Systems platform provides two levels of parallelism:

1. Data partitioned parallel processing - Here, data is partitioned into parts and the parts are distributed across multiple slave nodes. Data partitioning provides the ability to execute an ECL operation simultaneously on every slave node, where, each slave node operates on its data parts.

2. Data flow parallel processing - Here, the ECL program is compiled to an optimized data flow graph (aka Explain plan, aka Execution Plan), with each node representing an operation.

For example, the following ECL code:

is compiled to the following data flow graph representation:

The compiler is able to understand (shown by the split above) that the individual SORT operations have no dependency on each other, and can be executed in parallel.

In contrast, let us consider a situation where one ECL command depends on another ECL command.

Here, the DEDUP operation depends on the output of the SORT operation. The ECL compiler automatically generates the correct data flow graph:

The data flow architecture has been around for a long time. In fact, most RDBMS SQL engines are based on using a data flow architecture. That said, using a data flow architecture in a distributed data processing environment provides for a very powerful solution. Another great example of a data flow engine is Google's TensorFlow. With ECL, HPCC Systems provides a simple interface to program with, while abstracting the complexity of parallel processing to the data flow architecture of the ECL compiler. The ECL compiler maximizes the use of all the computational power (CPUs, GPUs etc.) by deploying the ECL computations for parallel processing.

Saturday, October 06, 2012

ECL for dummies

If you are new to ECL but have a strong SQL background, learning ECL should be trivial. I dedicate this post to showing you code snippets of ECL and the equivalent SQL. This should help you quickly get started with ECL.

Defining a Dataset

Query Processing and Extraction

Transformations

Browse the ECL Language Reference for further reading and detailed descriptions of the ECL functions used above.

Friday, April 20, 2012

Dynamic ECL

One of the powerful features of the HPCC Systems platform, is its ability to dynamically execute ECL code. Think of it as the ability to run a script (like SQL, Java Script, Groovy etc) on the fly in a production system. The platform provides you with this capability via the ECL Direct services interface.

The ECL Direct service can be invoked via SOAP or plain HTTP. For example, we could write code in Java to execute ECL on a remote cluster in the following fashion:

The example uses the Open Source HPCC Systems platform running within a Virtual Machine. Additional credentials would be required to be passed if the security module is enabled.

How is this feature useful? In general this feature allows ECL to be passed to HPCC Clusters from within other programming languages similar to the way programmers currently interface with traditional SQL database servers. The following are a few specific cases where the ECL Direct interface serves a purpose:

For integrating HPCC with ETL tools like Pentaho, Informatica etc
For building query interfaces like JDBC and ODBC
Online dynamic query applications
Building IDE and Debugging tools for ECL
and more....

As you can see the HPCC Systems platform provides us with a great amount of flexibility with Dynamic ECL. The ability to write and submit ECL to a live environment without the need to compile and package code makes it easy for developers to build powerful integration's.

Wednesday, October 26, 2011

ECL - A Practical Language for a Practical Data Programmer

Let us say you have a list of objects that contain Person information like First Name, Last Name, Phone Number, Date of Birth and Address. It is required to transform each object into a new object which contains a Name field, Phone Number, Date of Birth and Address. Where the Name field is the full name of the person, and the rest of the fields are a copy of the original objects corresponding field values.

If we attempt to write the solution in Java, the function to perform the conversion for each Person object will look something like:

public PersonEx convert(Person person) {
PersonEx personEx = new PersonEx();
personEx.name = person.lastName + " " + person.firstName;
personEx.phoneNumber = person.phoneNumber;
personEx.address = person.address;
personEx.dateOfBirth = person.dateOfBirth;
return personEx;
}

Now, the equivalent code in ECL is :

PersonEx convert(Person person) := TRANSFORM
SELF.name := person.firstName + ' ' + person.lastName;
SELF := person;

end;

The ECL code looks much simpler but achieves the same objective. Let me explain. ECL automates programming steps as much as possible based on the information that is available to the compiler. In the above example, ECL knows that the return value is a record of type "PersonEx". Hence, the keyword "SELF" is equivalent to "PersonEx self = new PersonEx();" in Java. The instance creation and association is implicit. This eliminates the need to type in all the extra code that greatly simplifies your programming task.

There is some more implicit magic here. What does the statement "SELF := person;" do? You guessed right. It is equivalent to explicitly writing the following code:

SELF.phoneNumber := person.phoneNumber;
SELF.address := person.address;
SELF.dateOfBirth := person.dateOfBirth;

ECL, by introspection, compares the two objects and automatically initializes the variables that have the same names that have not been previously initialized.

To summarize:

ECL, provides us many more features to make our programming lives easier. The following are some examples:

PROJECT(persons, convert(LEFT));

The "LEFT" indicates a reference to a record in "persons". The "project" declares that for every record in the data set "persons", perform the "convert" transformation. Implementing something similar in java would need an iterator, loop and several variable initialization steps.

OUTPUT(persons(firstName = 'Jason'));

This action returns a result set "persons" after applying the filter "firstName='Jason'".

As you can seen in the above examples, ECL has been designed to be simple and practical. It enables data programmers to quickly implement their thoughts into programming tasks by keeping the syntax simple and minimal. Java is used in the examples to show you how ECL's contracting style can be used effectively to solve data processing problems. It does not mean that Java is not a practical language. It simply means that ECL is abstract enough to shield the programmer from complex language structures as in Java.

Tuesday, October 18, 2011

Parsing a Web Server Log File on Thor

One of the major benefits of Thor, the HPCC Systems data refinery component, is its ability to process large volumes of data, like data from log files without the need to perform pre-processing outside of the platform. In this blog entry you will be introduced to ECL's powerful regular expression library that makes data extraction look easy. ECL is the programming language that is used to program ETL (Extract, Transform and Load) jobs on Thor.

Understanding the Input Content

Typical log file data can be classified as semi-structured content as most log files have records and columns that can be extracted as against unstructured data, where record and column boundaries are hard to identify.

A web server log file is an example of a semi-structured file. A typical line in a web server log file that conforms to the standard common log format reads as:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326

ECL, the powerful distributed data programming language, has built in libraries to process both semi structured data (log files etc) and unstructured data (emails, html content, twitter feeds etc). This enables you to maximize the parallel processing capability of the platform right out of the gate. No holding back.

Designing for Extraction

Let us identify the tokens that are present in a web log file. This will enable us to code a token parser to successfully extract web server log data from a file that implements the common log format:

Writing ECL Code

We will now write an ECL program to parse a file that contains lines of data where each line is formatted as shown above.

Step 1 - Declare record structure to store the raw input



//Declare the record to store each record from 

//the raw input file. Since the file has lines of log data,

//the record will need one string field to store each line.

RawLayout := record

string rawTxt;

end;



//Declare the file. In this example, 

//for simplicity, the content is shown inline

fileRaw := dataset([{'127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif   HTTP/1.0" 200 2326'}], RawLayout);

Step 2 - Declare regular expression patterns to parse the tokens from each raw input record

Beware! This is where you would need to pick up a regular expressions book (or web site) if you are not already familiar with the topic.



pattern alphaFmt := pattern('[a-zA-Z]')+;

pattern alphaNumbFmt := pattern('[a-zA-Z0-9]')+;

pattern sepFmt :=  ' '+;

pattern numFmt := pattern('[0-9]')+;

pattern ipFmt := numFmt '.' numFmt '.' numFmt '.' numFmt; 

pattern identFmt := '-';

pattern authuserFmt := alphaNumbFmt;

pattern hoursFromGMT := pattern('[\\-\\+]') numFmt;

pattern yearFmt := numFmt;

pattern monthFmt := alphaNumbFmt;

pattern dayFmt := numFmt;

pattern hoursFmt := numFmt;

pattern minutesFmt := numFmt;

pattern secondsFmt := numFmt;

pattern dateFmt := '[' dayFmt '/' monthFmt '/' yearFmt ':' hoursFmt ':' minutesFmt ':' secondsFmt ' ' hoursFromGMT ']';

pattern cmdFmt := alphaFmt;

pattern notQuoteFmt := pattern('[^"]')*;

pattern paramsFmt := opt('?' notQuoteFmt);

pattern urlFmt := pattern('[^"\\?]')*;

pattern httpMethodFmt := 'HTTP/' numFmt '.' numFmt;

pattern requestFmt := '"' cmdFmt urlFmt paramsFmt httpMethodFmt '"';

pattern statusFmt := numFmt;

pattern bytesFmt := numFmt;



pattern line :=  ipFmt sepFmt identFmt sepFmt authUserFmt sepFmt dateFmt sepFmt requestFmt sepFmt statusFmt sepFmt bytesFmt;

The declarations above are easy to follow if you know your regular expressions. These pattern declarations are used by the parser to extract tokens from the raw input record and map them to a (structured) model that can then be used to perform further processing.

The pattern line declaration specifies how each line in the file should be parsed and interpreted as tokens.

Step 3 - Declare the new record that will contain the extracted tokens



LogLayout := record 

string ip := matchtext(ipFmt);

string authUser := matchtext(authuserFmt);

string date := matchtext(dateFmt);

string request := matchtext(requestFmt);

string status := matchtext(statusFmt);

string bytes := matchtext(bytesFmt);

end;

The matchtext() function is used to extract the specific token you are interested in from the parser.

Step 4 - Parse the file and output the result



logFile := parse(fileRaw,

rawTxt,

line,

LogLayout,first);



output(logFile);

The parse function accepts the file to parse, the field in the RawRecord to parse, the token format for each line, the output record layout and a flag as parameters. The flag value first indicates "Only return a row for the first match starting at a particular position".

Step 5 - Submit the program to Thor and view the results

Once the program is submitted (run), the output should look like:

There a few variations of this program that you can implement.

Some Variations

What if the data source is actually a sprayed file?



fileRaw  := dataset('~.::myfile',RawLayout,csv(separator('')));

You will simply replace the fileRaw declaration with the one above. "~.::myfile" is the logical name of the sprayed file.

How can you record error lines that do not match the specified pattern? In other words malformed input.

ErrorLayout := record

   string t := fileRaw.rawTxt;

end;



e := parse(fileRaw,

rawTxt,

line,

ErrorLayout,NOT MATCHED ONLY);



output(e);

As you have seen, the ECL language and Thor provides you with a powerful framework to accomplish your ETL tasks. You can learn more about Thor and ECL at http://hpccsystems.com.

Wednesday, July 20, 2011

MapReduce Vs. HPCC Platform

Let us assume that we have to solve the following problem:

"Find the number of times each URL is accessed by parsing a log file containing URL and date stamp pairs"

Solution 1

This is easy. Traverse the log file, one line at time. Record the count for every unique URL that is encountered. This is easily accomplished in a Java program using a for loop and a hashmap.

Solution 1 works great if the input file is small. What if you are dealing with a large volume of data? Tera, giga, peta etc. The sequential processing is not a practical option for dealing with Big Data.

Solution 2

In this solution, the input is split into multiple <key, value> pairs and fed to a map function in parallel. The map function then converts it into intermediate <key, value> pairs. In our example, the map function will output <url, 1> pair for for every input <key, value>. Where the url is the url identified from the input and 1 is for each unique occurrence.

Before Map Step:

Key = 1 Value = http://hpccsystems.com
Key = 2 Value = http://hpccsystems.com
Key = 3 Value = http://hpccsystems.com/developers
Key = 4 Value = http://hpccsystems.com/downloads

After Map Step:

Key = http://hpccsystems.com Value = 1
Key = http://hpccsystems.com Value = 1
Key = http://hpccsystems.com/developers Value = 1
Key = http://hpccsystems.com/downloads Value = 1

The data is then sorted by the intermediate keys (<url1, 1>, <url2,1> etc) so that all occurrences of the same key are grouped together. Every unique key identified and all the values is then passed to a reduce function. The reduce function can be called multiple times for the same key. In our example, the reduce function will simply count the occurrences for the unique url in the reduce step - <url, total count>

After Reduce Step:

Key = http://hpccsystems.com Value = 2
Key = http://hpccsystems.com/developers Value = 1
Key = http://hpccsystems.com/downloads Value = 1

This process of solving the problem by using the Map and Reduce steps is called MapReduce. The MapReduce paradigm was made famous by Google. Google used to process large volumes of crawled data, by spreading the map and reduce jobs to several worker nodes in a cluster to execute it in parallel to achieve high throughput. Another well known MapReduce framework is Hadoop.

The major drawback of the MapReduce paradigm is the fact that it is intended to process batch oriented jobs. It is suitable for ETL (Extraction, Transformation and Loading) but not for online query processing. So Hadoop and its extensions Hive and HBase, have been built on top of a batch oriented framework that is not really meant for online query processing. The other drawback is the fact that every task needs to be defined in terms of a Map and Reduce step so that work can be distributed across nodes. In most cases it will take several Map and Reduce steps to solve a single problem.

Solution 3

SELECT url, count(*) FROM urllog GROUP BY url

How easy is that? No map or reduce logic. SQL, with it declarative nature, lets us concentrate on the What logic. However, SQL databases do not adhere themselves to BigData processing. Typical BigData processing involve several clustered nodes processing information to produce an end result.

HPCC Systems ECL, is specifically designed to overcome the limitations of SQL. ECL is a truly declarative language, that is somewhat similar to SQL, and lets you solve the problem by expressing the code as the What rather than the How. The complexity around clustering is well encapsulated by the ECL language and hence is never really exposed to the programmer.

Simple ECL Code to find the count for each unique URL:

//Declare the input record structure. Assume an input CSV file of URL,Date fields

rec := RECORD
STRING50 url;
STRING50 date;
END;

//Declare the source of the data
urllog := DATASET('~tutorial::AC::Urllog',rec,CSV);

//Declare the record structure to hold the aggregates for each unique URL
grouprec := record
urllog.url;
groupCount := COUNT(GROUP);
end;

//The TABLE function is equivalent to an SQL SELECT command
//The following declaration is used to create a Cross Tab aggregate of the SELECT equivalent shown above
RepTable := TABLE(urllog,grouprec,url);

//Output the new record set
OUTPUT(RepTable);

Sample Input:


Sample Output:

The HPCC Platform will distribute the work across the nodes based on the most optimal path that is determined at runtime.

As compared to MapReduce frameworks like Hadoop, the HPCC Platform keeps it simple. Let the platform determine the best work distribution across nodes so that developer solves the What rather than worry about How the work is distributed across nodes. Further, the HPCC Platform comprises of two unique components that are optimized to solve specific problems. The Thor is used as an ETL (Extraction, Transformation, Loading and Linking) engine and the Roxie is used as the online query processing engine.

Monday, July 11, 2011

ECL - Part III (Declarative, Attributes (aka Definitions) and Actions)

ECL is a declarative programming language. In declarative programming, we express the logic of computation and do not describe the control flow or the state. For example, in Java you would:

x = 1;

y = 2;

System.out.println(x+y);

Java is an imperative programming language, where the sequence of steps you write dictates to the compiler in what order the code has to be executed. In the example, x = 1 is executed first, y = 2 next, and then System.out.println. Here, the programmer controls the sequence and state.

For the same code, in ECL you would:

x:= 1; //An attribute declaration
y:= 2;
output(x+y); //An action

The code looks similar to the Java code. The difference is that the steps x := 1 and y := 2 do not perform a state assignment operation nor do they define the control flow. It simply means, when there is a need to use x, substitute the value 1. Until there is a need, do not do anything.

Well, you might now ask the question - Is being a declarative programming language really important?

The answer is "YES". In parallel computing, it is best left to the platform to determine the optimal sequence (or parallel) steps to execute to reach an end goal. Performance and scale is everything.

Attributes and Actions

Attribute (aka Definition): A declaration such as x := 1; is representative of an attribute declaration where x is an attribute. It is important to note that an attribute is not equivalent to a variable state assignment, but rather a declaration. The difference is that a declaration, postpones the evaluation until the attribute is actually used.
Action: An action such as output(x+y); instructs the platform to execute a code snippet that would be expected to produce a result.

In ECL, every statement is either an attribute (aka definition) or an action. Declarative thinking, helps Big Data developers to worry about the problem solution (the What), rather than the need to worry about the sequence of steps, parallel programming techniques and state assignment (the How). Being declarative is another reason why ECL is a powerful language for Big Data programming.