Interesting software design question on interview in Facebook.
The question: You need to make a scalable distributed website parsing system consisting of identical nodes.
I did not find on the Internet any intelligible answer to this question, and confirmation of what answer is considered an acceptable for Facebook.
Therefore, I decided to write my own version of the architecture, which I consider the best. I am pleased to read about other solutions that will be better than this.
The basis of the system should be a network based on DHT. It will be responsible for the connectivity of the network of nodes, the reliability of data transfer, and most importantly for the quick determination of which files were parsed and which were not yet. Our distributed hash table should store information about the state of files that are parsed, and full links to them.
The second subsystem, which should be in each node, is a parser. The parser takes files from the local queue and parses them. Files are added to the queue by messages from the network. If the parser finds in the processed file links to new files that should be parsed, it sends a message about this to the network. This message adds the file to the queue of the corresponding node.
How it works.
1. Start the nodes.
2. Each node gets info about other nodes, does a DHT bootstrap procedure and they all interconnecting into a network.
3. A random node of of the network gets the command “add the website root file to your local parsing queue”.
4. The parser processes the file and finds links to other files there. For each such link, it calculates the hash and sends the hash and the link itself to the DHT network.
5. According to the principle of operation of DHT, this message will travel through the DHT network to a node whose ID is most similar to this hash.
6. When the message reaches the destination, it will be added to its parsing queue, and the parsing results will be stored in this node.
7. If new links are found during the parsing process, they will be sent to the DHT network and the whole procedure will be repeated starting from step 4.
Parsing will end when the queues of all nodes are empty. To add a new file to the network for parsing and to find the parsing results of any file, you need to send a message to no more than Log N nodes, where N is the number of nodes in the DHT network.
Network redundancy and reliability are achieved in the usual ways for DHT. To understand the principle of parser architecture, it makes no sense to describe them here. As everyone who has ever used torrents, iDonkey, i2p or similar DHT networks knows, they are known as phenomenal reliable and stable.