baijia - papers and notes

Full Version: Yu...DryadLINQ: A system for general-purpose distributed data-parallel...OSDI'08
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI), December 8-10 2008

* * *

The relation of some terms:

LINQ: a set of extensions to the .NET Framework that encompass language-integrated query, set, and transform operations. It extends C# and Visual Basic to support queries and provides class libraries to take advantage of these capabilities.
A link: http://msdn.microsoft.com/en-us/library/bb308959.aspx

Dryad: A distributed execution engine for data-parallel applications. Its role is similar to MapReduce to some extent.
http://baijia.info/showthread.php?tid=203

DryadLINQ: a framework that generates Dryad computations from the LINQ Language-Integrated Query extensions to C#

A DryradLINQ program uses of LINQ expressions to perform side-effect-free transformations on datasets, and can be written and debugged using standard .NET development tools. So using a DryadLINQ program is more convenient, since it can automatically and transparently translates the data parallel portions of the program into a distributed execution plan which is passed to Dryad execution platform.

The basic idea of DryadLINQ is running on .NET environment, first the user uses LINQ to specify the tasks they want to do and then DryadLINQ compile the LINQ expressions into a distributed Dryad execution plan which can be executed directly on Dryad. When Dryad finished running the task, it first writes the results to the output tables. Then DryadLINQ system created the local DryadTable objects encapsulating the results. In this case, the user can read the contents as .NET objects that stored in DryadTables. So, from the users' point of view, they do not need to take care of the complicate process of scheduling, distribution, and fault-tolerance steps. This is more friendly for users.

The experiments done in the evaluation part of this paper is representative. Especially, it compares DryadLINQ to Dryad, and shows the preformance of DryadLINQ is just a little worse than Dryad, while providing a great programming interface to users.

Hence, DryadLINQ is gives Dryad added values. Though Dryad is flexible, its graph vertexes arrangements can not be dealt with by casual users. Comparing to MapReduce, Dryad is more complex. Thus, DrayLINQ actually compensate this disadvantage of Dryad.

(laxab, lingu)
Liu,Peng (lpxz3141) will present this one.
(08-05-2009 11:23 PM)lingu Wrote: [ -> ]Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI), December 8-10 2008
To clarify some definitions first (since I was fuzzy about these definitions personally...)

1. SQL is a standard language for accessing and manipulating databases.

2. LINQ is a set of extensions to the .NET Framework that encompass language-integrated query, set, and transform operations. It extends C# and Visual Basic with native language syntax for queries and provides class libraries to take advantage of these capabilities.
A link: http://msdn.microsoft.com/en-us/library/bb308959.aspx

3. Drayd: (A distributed execution engine for data-parallel applications, like MapReduce)
A simple introduction:
http://research.microsoft.com/en-us/projects/Dryad/
A paper providing details:
http://research.microsoft.com/en-us/proj...http://research.microsoft.com/en-us/projects/dryad/eur

4. DryadLINQ generates Dryad computations from the LINQ Language-Integrated Query extensions to C#

----------------------------------------------------------------
That's very useful clarification.

The Dryad paper's baijia entry is http://baijia.info/showthread.php?tid=20...http://baijia.info/showthread.php?tid=203&highl
Dryad
The fundamental difference between Dryad and MapReduce is that a Dryad application may specify an arbitrary communication DAG rather than requiring a sequence of map/distribute/sort/reduce operations. In particular, graph vertices may consume multiple inputs, and generate multiple outputs, of different types. For many applications this simplifies the mapping from algorithm to implementation. Moreover, they exploit TCP pipes, build on a greater library of basic subroutines, and shared memory for data edges. These can bring much performance gains. Basically, it is more general than MapReduce.

DryadLINQ
A DryradLINQ program take use of LINQ expressions performing arbitrary side-effect-free transformations on datasets, and can be written and debugged using standard .NET development tools. So using a DryadLINQ program is more convenient, since it can automatically and transparently translates the data parallel portions of the program into a distributed execution plan which is passed to Dryad execution platform.

Actually, the programming interfaces all leave room for improvement not only for Drayd, but also for MapReduce for instance. An example is that it is necessary to embed MapReduce computations in a scripting language in order to execute programs that require more than one reduction or sorting stage. To deal with this problem, some system are developed, like Pig, Sawzall, Facebook's HIVE.

The basic idea of DryadLINQ is running on .NET environment, first the user use LINQ to clarify the tasks they want to do and then DryadLINQ compile the LINQ expressions into a distributed Dryad execution plan which can be executed directly on Dryad. When Dryad finished running the task, it first writes the results into the output tables. Then DryadLINQ system created the local DryadTable objects encapsulating the results. In this case, the user can read the contents as .NET objects that stored in DryadTables. So, from the users' point of view, they do not need to take care of the complicate process of scheduling, distribution, and fault-tolerance steps. This is more friendly for users.

The experiments done in the evaluation part of this paper is representative, especially it compare DrayLINQ to Dray, and show us the preformance of DrayLINQ is just a little worse than Drayd, while providing a great programming interface to users.

So, basically I think DryadLINQ is the system that give Dryad more values. Since though Dryad is flexible, its graph vertexes arrangements can not be dealt with by most usual users. Comparing to MapReduce, Dryad is more complex. Thus, DrayLINQ actually compensate this disadvantage of Dryad.
The most difference between DraydLINQ with other technologies like map/reduce, the general programming on GPU, parallel databases is that DryadLINQ compiles user programs into Dryad tasks and runs them on clusters under the hood with little awareness of the user. So user need not to handle the program diagram. DryadLINQ implemented some optimization techniques, for example it executes multiple operators in a same machine, removing unnecessary partitioning steps.

But there are some disadvantages in the design and implementation of DrayLINQ. For example, there is one job manager works at the same time. So job manager may be become the bottleneck and the paper did not mention that whether DryadLINQ supports more job managers.
@inproceedings{yu08dryadlinq,
author = {Yuan Yu and
Michael Isard and
Dennis Fetterly and
Mihai Budiu and
{\'U}lfar Erlingsson and
Pradeep Kumar Gunda and
Jon Currey},
title = {DryadLINQ: A System for General-Purpose Distributed Data-Parallel
Computing Using a High-Level Language},
booktitle = {OSDI},
year = {2008},
pages = {1-14},
ee = {http://www.usenix.org/events/osdi08/tech/full_papers/yu_y/yu_y.pdf},
}
Reference URL's