Distributed TensorFlow and the hidden layers of engineering work

Google published a pair of solution tutorials to show you how you can create and run a distributed TensorFlow cluster on Google Compute Engine and run the same code to train the same model on Google Cloud Machine Learning Engine. The solutions use MNIST as the model, which isn’t necessarily the most exciting example to work with, but does allow us to emphasize the engineering aspects of the solutions.

Google already talked about the open-source nature of TensorFlow, allowing you to run it on your laptop, on a server in your private data center, or even a Raspberry PI. TensorFlow can also run in a distributed cluster, allowing you divide your training workloads across multiple machines, which can save you a significant amount of time waiting for results. The first solution shows you how to set up a group of Compute Engine instances running TensorFlow, as in Figure 1, by creating a reusable custom image, and executing an initiation script with Cloud Shell. There are quite a few steps involved in creating the environment and getting it to function properly. Even though they aren’t complex steps, they are operational engineering steps, and will take time away from your actual ML development.

Read More