Efficient synchronization algorithm -


Let me find a large sort of node_a (+ 10 MB, + 650k lines) There is a dataset and there is no master version of the dataset dataset at node_b , which means there may be some pieces in the node that are not available for other nodes. My goal is to synchronize content with node_b with the contents of node_a . What is the most effective way to do this?

Common sense solution will be:

node_e: here I have everything ... (sends the entire dataset)

node_b: this is what you have Is not ... (sends missing part)

But this solution is not perfect at all. To try to synchronize it every <+> 10 + (10 MB) is required to send node_a .

By using a little bronchitis at this time, I can start a partition of a dataset, only sending it

Can you think of any better solution?