Training robotic arm on bulk grasping in 3D environment
We train a robotic hand to grasp in 3D by combining Tensorflow with Unity3D through ROS. Thus, we simulate an industrial robot overlooking a bucket of objects in bulk. 2D images from the camera are provided via ROS as inputs to an AI. Our Deep Learning model then detects graspable objects in the bucket and sends back corresponding 3D motor commands through ROS, allowing the robotic arm to perform in simulation. Both Unity3D and Tensorflow rely on heavy NVIDIA GPU computation to achieve real-time operation. Our AI implementation outranks the state-of-the-art in tasks of grasping as they typically output 2D positions on a table rather than 3D in bulk.
Training robots in a real environment can be a daunting task. First, data collection can take a huge amount of time and efforts as movements of the robot must be controlled properly to avoid damage, and as a result are kept cautiously slow. In addition, any change in the experimental setup requires to reevaluate the safety measures and therefore one would remain reluctant to try different configurations. Instead, we base our experimental setup on Nvidia’s physics engine PhysX in Unity3D, which allows to recreate natural motions and realistic conditions for quickly testing various training environments. Notably, GPU computations make it possible to operate in real time, as we simulate the behavior of many colliding objects – the objects in the bucket, the bucket itself, and the robot – with visual effects such as lights, shadows, reflections and texture mapping.
Thus, we drop 3D objects of various sizes in an unordered and natural way into a bucket. For data collection, we place the robot’s grippers at strategic positions alongside each object and detect potential collisions with other objects while picking. If no collision occurs, the position is considered graspable. For each simulation, we save the camera view and the 3D grasp positions. This operation can be repeated as many times as required, easing the data collection tremendously in comparison with a real-world situation.
Our AI model is a 3D adaptation of state-of-the-art deep learning architectures for object detection (e.g., Faster-R-CNN , Yolo , Single Shot Detector ) on Tensorflow. High GPU parallelism with Cuda and Cudnn libraries speeds up training significantly on the millions of data samples acquired in simulation. Using the camera view as input, the architecture outputs 3D grasp positions. During inference, inputs and outputs are exchanged through ROS in real-time between the 3D simulation and the AI engine. The simulated robot then performs the 3D grasp according to the AI model’s outputs.
State-of-the-art grasping methods [1; 2; 4] typically address a situation where various objects are spread out on a table or in a bucket. The AI models take as input the camera view and produce, for each object, the 2D location [1; 2] and 1D orientation [1; 2; 4] for grasping. Contrastively, our implementation outranks them for it provides a more general solution to this problem, notably by controlling a robot in conditions of bulk as well. Indeed, our AI infers from a 2D image not only the 3D location of the objects but also the best angle of approach for the grippers.
 Asif, U., Tang, J., & Harrer, S. (2018). Densely Supervised Grasp Detector (DSGD). arXiv preprint. arXiv:1810.03962v1
 Levine, S., Pastor, P., Krizhevsky, A., & Quillen, D. (2016). Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection. arXiv preprint. arXiv:1603.02199v4
 Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A.C. (2015). SSD: Single Shot MultiBox Detector. arXiv preprint. arXiv:1512.02325v5
 Pinto, L., & Gupta, A. (2015). Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours. arXiv preprint. arXiv:1509.06825v1
 Redmon, J., & Farhadi, A. (2018). YOLOv3: An Incremental Improvement. arXiv preprint. arXiv:1804.02767v1
 Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv preprint. arXiv:1506.01497v3
著者： Romain Angénieux