Dean, J. and Ghemawat, S. 2004. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (San Francisco, CA, December 06 - 08, 2004).
* * * * *
MapReduce is a parallelization framework for data-centric parallelization. It fulfills a computation task in three steps: map, shuffle, and reduce. The programmer, however, only needs to write explicit code for the map and reduce steps.
The input of a MapReduce program is a collection of <key, value> tuples. Essentially, map transforms the input tuples into intermediate <key', value'>.
Logically, shuffle waits for all intermediate tuples to be ready, then sorts and groups the intermediate tuples to <key', list(value')>.
The reduce step iterates through list(value') corresponding to a key, and performs computation on the list.
Both map and reduce can be parallelized, assuming the data items are independent of each other. The shuffle step, I think, can partially be parallelized.
PDF from google:
http://labs.google.com/papers/mapreduce-osdi04.pdf
Videos on cluster computing and MapReduce from Google:
http://code.google.com/edu/submissions/m...http://code.google.com/edu/submissions/mapreduce-minilecture/li
@inproceedings{dean04mapreduce,
author = {Dean, Jeffrey and Ghemawat, Sanjay},
title = {MapReduce: simplified data processing on large clusters},
booktitle = {Proc. of the 6th conference on Symposium on Opearting Systems Design \& Implementation},
year = {2004},
pages = {10--10},
location = {San Francisco, CA},
}
In my personal view I really want to classify MapReduce as an industrial realization of the traditional parallel techniques. In this way, as a industrial product, we may not consider it to have noble features like the ones in research fields. Instead, how well does it support the industrial usage (eg. The search engine service of GOOGLE Inc.) become a key point. Since it works well on searching works, we may give a good mark to it.
The following are my comments to the article "MapReduce: A major step backwards":
Let's review the 5 points mentioned in the article:
1. A giant step backward in the programming paradigm for large-scale data intensive applications
2. A sub-optimal implementation, in that it uses brute force instead of indexing
3. Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
4. Missing most of the features that are routinely included in current DBMS
5. Incompatible with all of the tools DBMS users have come to depend on
In fact, for the first 3 points, I find they are not really reasonable. They think the recent things must be better than the old things. Like indexing verses brute force. The belief of mapReduce uses parallel processing techniques that gives each machine a small task to work on, then the small piece of work can be simply using brute force and direct approach to complete the task. Hence, it is not wise to compare indexing - the skill to speed up accessing for frequency uses of data, with mapReduce that completes a specific task for a time. Of course mapReduce can be used for support queries, that's another case to be mentioned in point 4.
I think the 4th point is the most reasonable one because mapReduce only considers the completion of a single task, rather than keeping a good and stable environment for developers to build up a system. For example, mapReduce cannot support updating, dependent queries, views, tables and most importantly, consistency checking which makes mapReduce difficult to replace DBMS like Oracle.
The 5th points is a matter of time. The is no restrictions that we cannot build any tools to connect the inputs and outputs of mapReduce to applications to speed up some tasks. Perhaps DBMS system might consider let some queries to be done by multiple machines to improve the performance.
(09-25-2009 01:05 AM)tonyho Wrote: [ -> ]I think the 4th point is the most reasonable one because mapReduce only considers the completion of a single task, rather than keeping a good and stable environment for developers to build up a system. For example, mapReduce cannot support updating, dependent queries, views, tables and most importantly, consistency checking which makes mapReduce difficult to replace DBMS like Oracle.
Yes that's a valid point. These features (e.g., stronger consistency checking) are good to have. While questioning the completeness of MapReduce, we need to also investigate (or speculate, at least) the following items.
1. Why are the features missing in MapReduce
2. Are the feature just missing in the early version of MapReduce? (Has Google added some important features to MapReduce after publishing the MapReduce paper?)
3. If we want to improve, how can we add one or two of these features?
(09-22-2009 09:51 PM)szh Wrote: [ -> ]In my personal view I really want to classify MapReduce as an industrial realization of the traditional parallel techniques. In this way, as a industrial product, we may not consider it to have noble features like the ones in research fields. Instead, how well does it support the industrial usage (eg. The search engine service of GOOGLE Inc.) become a key point. Since it works well on searching works, we may give a good mark to it.
Can you name one specific traditional parallel techniques that MapReduce realizes? It would be interesting to compare them.
In my opinion, MapReduce is designed to handle this kinds of problems: The problem and the data to process is large-scale and the input data is already on the net file system such as GFS. But in some situation, such as distributed compilation, we may need to run a very large number of small of small jobs with small input data for each job. But unfortunately, MapReduce is not very good at the seconds kind of problem compared to the previous one.
The MapReduce is kind of "one master" architecture. I think the work of the master server may be distributed to the servers. Such as number of masters and number of mappers and numbers of reducers. One master and several mappers and reducers are responsible for one small job. If there are many small jobs, there are many master-mapper-reducer groups. These small groups may complete one job with small latency. It can be considered that many MapReduces inside of one large cluster. And these smaller MapReduces may use dynamically number of servers. If the job is large, the groups is large. If the job is small, then we can assign smaller number of servers to this group.
MapReduce on Amazon's cloud service.
According to the project management, "Cloud MapReduce was initially developed at Accenture Technology Labs. It is a MapReduce implementation on top of the Amazon Cloud OS.
Cloud MapReduce has minimal risk w.r.t. the MapReduce patent, compared to other open source implementations, as it is implemented in a completely different architecture than described in the Google paper."
Project web site:
http://code.google.com/p/cloudmapreduce/