ML research platform
I built an ML research tool to monitor training experiments, compare models, view realtime training visualization, distribute jobs across workers and run edge inference on a Raspberry Pi to mirror the infrastructure behind machine learning research and learn from it.
[
]Coming into this I was (and partly still am) a complete noob. Tesla FSD is impressive to me, so I built this ML pipeline to understand exactly how that system works at scale. Each vehicle runs inference: an ML model sent OTA to it and reports telemetry and sensor data back to the company, in realtime (I imagine). To replicate this, I bought a Raspberry Pi 5 that acts as my “vehicle” where I run inference of the model trained on my computer. I also bought an MPU sensor that reports live data back to my computer and is displayed on a Nextjs dashboard in realtime.
The system is powered by a Flask server. A researcher sends research parameters (called a job) using curl to a Flask API endpoint that adds that job to the current queue. Redis workers take items off the queue in a FIFO manner and start working on it. As they work on it, they report data back to the flask server which saves that data into a postgresql database, then emits it through a websocket. Our dashboard will be listening on the same websocket for those data emissions. Every time the flask server emits that data, the dashboard “hears” it and displays it.
Imagine two people in this scenario; person A screams whatever info he hears, and person B who’s sitting around hears what person A is screaming out loud and writes that down. In this case, Flask is person A shouting out the data into the void, and the dashboard is person B displaying whatever it hears.
(I swapped out Redis workers for Modal because it began throttling my computer.)
This is what is referred to as “AI training,” on a much smaller and abstracted scale.
Training the AI is one thing, but we need to test it. In training we are showing the AI data and telling it what that data means, in testing we are showing it data and asking it to tell us what that data means to confirm it did learn during training. Most datasets, like the one I used for this project, comes with training and test sets.
To test the model, we save it to our computer after training and transfer the saved model to the Raspberry pi then run it, and watch as it relays data back to the flask server that gets displayed on the dashboard to let us know if the model really did learn.
Technical guide on my GitHub repo here