baijia - papers and notes

Full Version: Hunt:ZooKeeper:wait-free coordination for Internet-scale systems.USENIX10
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Ree. ZooKeeper: Wait-free Coordination for Internet-Scale Systems. USENIX'10 (Best Paper Award)

PDF from USENIX: http://www.usenix.org/events/atc10/tech/...http://www.usenix.org/events/atc10/tech/full_paper
Notes from Zhonghua

This paper proposed a service used by Yahoo! for coorinating processes of distributed applications. We may see this service as an enhanced file system with key/value based storage ability and per client guarantee of FIFO execution. This service is "wait-free" because that each client stores a replica of the informations and updates them with "push" method. ("watch" in the paper. On every changes of a znode(value to a key), the message is pushed to the clients). When writing, the write request is sent and executed in sequence. (So there maybe some client's value not updated on time. This may cause the value obtained from these clients invailidate. The author solved this by adding a "sync" read which require all writes before it being applied. This is something like the "flush" operations but may lead to a problem of "non-wait-free".)

1, When the system scales up, there may be some chances that one znode becomes a hot spot and quite a lot of clients are watching it. In this circumstance, sending messages to all the clients whenever the znode is updated may introduce too much overhead.
2, If some clients are expecting the value of some frequently changed znode to become some value. Saying that, if the znode represents some events. Some clients want to execute some code when event 5 happens, and some are waiting for event 19, etc. Then the event code should be broadcasted to all the clients no matter whether it actually need this event. Then all the nodes should come to read the value and set the watch again. If we may add an API of event-driven watch that if the value of a znode changed to an expected value, the client will revceive an notification. Should it be better?
3, In section 4.3, the author mentioned the snapshot for recovering the states after a crash. The fuzzy snapshots is a good way. But the author claimed that "since state changes are idempotent, we can apply them twice as long as we apply the state changes in order", may it work for "sequential" znodes? If we apply a updating on sequential znodes twice, the the znode should be different because it should "have the value of a monotonically increasing counter appended to its name" (as in Section 2.1)
4, In section 2.2, "Each request instead includes the full path of the znode being operated on. Not only does this choice simplifies the APIs, but it also eliminates extra state that the server would need to maintain". This is quite similar to the stateless design in FS.
2, Develop a basic service and let the user develop there own primitives is a good choice. But how shall we justified it is enough in to present other primitives? Or enough to present some kinds of primitives.

Referenece 6 of the paper:
M.Burrows. The Chubby lock service for loosely-coupled distributed systems. In OSDI'06.
http://baijia.info/showthread.php?tid=59...http://baijia.info/showthread.php?tid=59&highli
Good review and I like some of the technical discussions. Beyond that, maybe we can also think of what we should do if we want to construct a stronger solution with easier-to-use guarantees. This also relates to the understanding of how ZooKeeper is exactly implemented. For example, why is it required that full path be given for a znode. If we want to remove this limitation, what complication would manifest and would it be possible to tackle the problem?

(11-18-2010 10:03 AM)szh Wrote: [ -> ]Notes from Zhonghua

This paper proposed a service used by Yahoo! for coorinating processes of distributed applications. We may see this service as an enhanced file system with key/value based storage ability and per client guarantee of FIFO execution. This service is "wait-free" because that each client stores a replica of the informations and updates them with "push" method. ("watch" in the paper. On every changes of a znode(value to a key), the message is pushed to the clients). When writing, the write request is sent and executed in sequence. (So there maybe some client's value not updated on time. This may cause the value obtained from these clients invailidate. The author solved this by adding a "sync" read which require all writes before it being applied. This is something like the "flush" operations but may lead to a problem of "non-wait-free".)

1, When the system scales up, there may be some chances that one znode becomes a hot spot and quite a lot of clients are watching it. In this circumstance, sending messages to all the clients whenever the znode is updated may introduce too much overhead.
2, If some clients are expecting the value of some frequently changed znode to become some value. Saying that, if the znode represents some events. Some clients want to execute some code when event 5 happens, and some are waiting for event 19, etc. Then the event code should be broadcasted to all the clients no matter whether it actually need this event. Then all the nodes should come to read the value and set the watch again. If we may add an API of event-driven watch that if the value of a znode changed to an expected value, the client will revceive an notification. Should it be better?
3, In section 4.3, the author mentioned the snapshot for recovering the states after a crash. The fuzzy snapshots is a good way. But the author claimed that "since state changes are idempotent, we can apply them twice as long as we apply the state changes in order", may it work for "sequential" znodes? If we apply a updating on sequential znodes twice, the the znode should be different because it should "have the value of a monotonically increasing counter appended to its name" (as in Section 2.1)
4, In section 2.2, "Each request instead includes the full path of the znode being operated on. Not only does this choice simplifies the APIs, but it also eliminates extra state that the server would need to maintain". This is quite similar to the stateless design in FS.
2, Develop a basic service and let the user develop there own primitives is a good choice. But how shall we justified it is enough in to present other primitives? Or enough to present some kinds of primitives.

Referenece 6 of the paper:
M.Burrows. The Chubby lock service for loosely-coupled distributed systems. In OSDI'06.
http://baijia.info/showthread.php?tid=59...http://baijia.info/showthread.php?tid=59&highli
ZooKeeper is a kernel for building coordination primitives for processes
of distributed applications. Instead of building service for specific
application, ZooKeep provides a simple and high performance kernel for
application to implement primitives. Instead of using blocking
primitives such as locks, ZooKeeper uses wait-free data object to avoid
faulty clients slowing down the overall system.

ZooKeeper organize the data objects in a hierachical way as in file
systems and the hierachical relationship between objects is also used to
implement primitives. A hierachical architecture like the UNIX file
system can provides a friendly abstract to the human users. But it may
not so important for applications since the users of the service is
computer program rather than human. On the other hand, in the
hierachical tree, the system need to travel along the path from the root
to the leaf of the tree the get the znode. When the system scales large,
the cost of travel along the path may cost much. However, a "key-value
list" pair system without adding much more mechanism than ZooKeeper can
implement most of the functions of ZooKeeper:

A distributed "key-value list" pair store with the similar watcher,
distributed linearisable writes, FIFO client order, sequence value
appendix and other features like ZooKeeper. Except write one value of
one key, we add value to the current value list of the key, and delete
value from the value list of the key.

For the primitives in the paper that do not requires the help of the
hierarchical relationship, such as leader election and configuration
management, the "key-value list" pair store can act as the same as
ZooKeeper. And actually, only one value for one key is used.

For the primitives in the paper that uses the hierarchical relationship,
such as group membership and locks, we use the key - value list
relationship to implement the two level hierarchy as in ZooKeeper.
Although this limit the hierarchy level to two, we find the most of the
primitives (at least the one in the paper) do not use complex hierarchy
more than two level.

Using the "key-value list" pair can help to design a more scalable
system and the pairs can be easier to be partitioned into distributed
nodes than a tree.
(02-15-2011 12:25 AM)zma Wrote: [ -> ]For the primitives in the paper that do not requires the help of the
hierarchical relationship, such as leader election and configuration
management, the "key-value list" pair store can act as the same as
ZooKeeper. And actually, only one value for one key is used.

For the primitives in the paper that uses the hierarchical relationship,
such as group membership and locks, we use the key - value list
relationship to implement the two level hierarchy as in ZooKeeper.
Although this limit the hierarchy level to two, we find the most of the
primitives (at least the one in the paper) do not use complex hierarchy
more than two level.

Using the "key-value list" pair can help to design a more scalable
system and the pairs can be easier to be partitioned into distributed
nodes than a tree.

The review is good. You can reduce the introduction part and make it shorter if you like.

I am not sure of the equivalence of key-value transactions and the hierarchical transactions. Let me re-visit the paper and get back to you later.
Reference URL's