June 22, 2016

Gearing Up for 2016 Telluride Neuromorphic Cognition Engineering Workshop

Guest Blog by Andrew Cassidy and Rodrigo Alvarez-Icaza

Gearing up. We are preparing for the 2016 Telluride Neuromorphic Cognition Engineering Workshop, in the Colorado mountain town. Beginning Sunday Jun 26th, this annual workshop brings together nearly 100 researchers from all around the world to investigate brain-inspired solutions to topics such as:

  • Decoding Multi-Modal Effects on Auditory Cognition
  • Spike-Based Cognition in Active Neuromorphic Systems
  • Neuromorphic Path Planning for Robots in a Disaster Response Scenario
  • Neuromorphic Tactile Sensing
  • Computational Neuroscience

IBM's Brain-Inspired Computing Group is sending two researchers with an end-to-end hardware/software ecosystem for training neural networks to run, in realtime, on the 4096 core TrueNorth neurosynaptic processor. The Eedn (Energy-efficient deep neuromorphic network) training algorithm enables near state-of-the-art accuracy on a wide range of visual, auditory, and other sensory datasets. When run on TrueNorth, these networks can be run at between 25 and 275mW, achieving >6000 FPS/W performance.

We are bringing (Figures 1-3):

  • 16 NS1e boards (each with 1 TrueNorth neurosynaptic processor)
  • 1 server (with 4 Titan X GPUs) for training deep neuromorphic networks
  • and a bucket of cables.
Building on the successes from last year's workshop, and leveraging the training material from Bootcamp, our goal is to enable train, build, and run for workshop participants. Combined with real-time runtime infrastructure to connect input sensors and output actuators to/from the NS1e board, we have all of the tools in place to build low-power end-to-end mobile and embedded systems, to solve real-world cognitive problems.

NS1e
Figure 1. Sixteen NS1e Boards
Training Server
Figure 2. Training Server and Gear
Prep Station
Figure 3. Prep Station
Photo Credits: Rodrigo Alvarez-Icaza

June 09, 2016

PREPRINT: Structured Convolution Matrices for Energy-efficient Deep learning

Guest Blog by Rathinakumar Appuswamy

To seek feedback from fellow scientists, my colleagues and I are very excited to share a preprint with the community.

Title: Structured Convolution Matrices for Energy-efficient Deep learning

Authors: Rathinakumar Appuswamy, Tapan Nayak, John Arthur, Steven Esser, Paul Merolla, Jeffrey Mckinstry, Timothy Melano, Myron Flickner, Dharmendra S. Modha 

Extended Abstract: We derive a relationship between network representation in energy-efficient neuromorphic architectures and block Toplitz convolutional matrices. Inspired by this connection, we develop deep convolutional networks using a family of structured convolutional matrices and achieve state-of-the-art trade-off between energy efficiency and classification accuracy for well-known image recognition tasks. We also put forward a novel method to train binary convolutional networks by utilising an existing connection between noisy-rectified linear units and binary activations. We report a novel approach to train deep convolutional networks with structured kernels. Specifically, all the convolution kernels are generated by the commutative pairs of elements from the Symmetric group S4. This particular structure is inspired by the TrueNorth architecture and we use it to achieve an improved accuracy vs energy tradeoff than we had previously reported. Our work builds on the growing body of literature devoted to developing convolutional networks for low-precision hardware toward energy-efficient deep learning.

Link: http://arxiv.org/abs/1606.02407

June 08, 2016

PREPRINT: Deep neural networks are robust to weight binarization and other non-linear distortions

Guest Blog by Paul A. Merolla

To seek feedback from fellow scientists, my colleagues and I are very excited to share a preprint with the community.

Title: Deep neural networks are robust to weight binarization and other non-linear distortions

Authors: Paul A. Merolla, Rathinakumar Appuswamy, John V. Arthur, Steve K. Esser, Dharmendra S. Modha 

Abstract: Recent results show that deep neural networks achieve excellent performance even when, during training, weights are quantized and projected to a binary representation. Here, we show that this is just the tip of the iceberg: these same networks, during testing, also exhibit a remarkable robustness to distortions beyond quantization, including additive and multiplicative noise, and a class of non-linear projections where binarization is just a special case. To quantify this robustness, we show that one such network achieves 11% test error on CIFAR-10 even with 0.68 effective bits per weight. Furthermore, we find that a common training heuristic--namely, projecting quantized weights during backpropagation--can be altered (or even removed) and networks still achieve a base level of robustness during testing. Specifically, training with weight projections other than quantization also works, as does simply clipping the weights, both of which have never been reported before. We confirm our results for CIFAR-10 and ImageNet datasets. Finally, drawing from these ideas, we propose a stochastic projection rule that leads to a new state of the art network with 7.64% test error on CIFAR-10 using no data augmentation.

Link: http://arxiv.org/abs/1606.01981

June 06, 2016

May 23-26, 2016: Boot Camp Reunion

Last year, from August 3 to August 20, 2015, IBM Brain-inspired Computing Team held a 3-week Boot Camp. Now, almost 9 months after, we held a Boot Camp Reunion from May 23 to 26, 2016 that brought together 64 attendees.

It was incredible to see the results from attendees and gratifying to see them productive on the next-gen Ecosystem, and achieve state-of-the-art results, within a matter of hours.

The following are three perspectives from my colleagues, Ben G. Shaw, Hartmut E. Penner, Jeffrey L. Mckinstry, and Timothy Melano.Don't miss the attendee comments at the bottom of this blog entry!

Developer Workshop
Photo Credit: William Risk

Line Separator

 

Guest Blog by Ben G. Shaw on Attendees.

We carefully selected and vetted all attendees.

The following institutions are return attendees:

  • Air Force Research Lab, Rome, NY and Dayton, Ohio
  • Arizona State University
  • Army Research Lab
  • Georgia Institute of Technology
  • Lawrence Berkeley National Lab
  • Lawrence Livermore National Lab
  • National University of Singapore
  • Naval Research Lab
  • Pennsylvania State University
  • Riverside Research
  • Rensselaer Polytechnic Institute
  • SRC
  • Syracuse University
  • Technology Services Corporation
  • University of California, Davis
  • University of California, Los Angeles
  • University of California, San Diego
  • University of California, Santa Cruz
  • University of Dayton
  • University of Pittsburgh
  • University of Tennessee, Knoxville
  • University of Ulm
  • University of Western Ontario
  • University of Wisconsin-Madison.
In addition, the following institutions are new attendees:
  • Department of Defense
  • Johns Hopkins University, Applied Physics Laboratory
  • Mathworks
  • MITRE Corporation
  • Oak Ridge National Laboratory
  • Pacific Northwest National Lab
  • RWTH Aachen & FZ Juelich / JARA
  • Sandia National Laboratories
  • Technical University of Munich
  • University of Florida
  • University of Notre Dame

Line Separator

 

Guest Blog by Hartmut E. Penner on Docker Infrastructure.

For the Bootcamp Reunion, we needed to create an environment where participants could use our latest release. From past experience, it was clear that even with the best possible installation instructions, it would not be possible to cope with the multitude of systems of the participants in the given time we had and we wanted to spend precious Reunion time exploring the new Eedn programming environment. And beside, the training required server software with specific high end GPUs to do efficient training in the short time.

To be able to support all this and minimize the installation time and effort necessary we decided to use the IBM SoftLayer Cloud to provide the system with the GPUs and Docker as a way to package the software. The SoftLayer Cloud provides so called Bare Metal Server with GPUs which gives the user full hardware access to the system all the way to a hardware console over a web interface. This Bare Metal Server we ordered came pre-installed with Ubuntu Linux 14.04.

Docker as a container technology allows to package an application with all it dependencies like runtime, system tools, system libraries and the code itself. Unlike virtualization technology, all instances share the same kernel and therefore have much lower resource consumption while still providing full isolation between them. Details are here.

For the setup of our software, we needed to have access to the GPU in a shared fashions to limit the amount of hardware. Fortunately, Nvidia provided a full version of a dockerized environment which could be used as a nucleus to build a specialized Docker container with all our Software and access to the GPUs from multiple container instances. The container image was based on the CentOS 7 version or the cuda:7.5-devel image, added with a VNC server, MATLAB, matconvnet and our TrueNorth software release. Each participant got an instance created with exactly the same software with only differences being: an user specific persistent storage, ssh port and public authorized SSH key. With that each user was able to connect to their instance and upload data, from any kind of Windows, MacOS or Linux laptop. VNC traffic was tunneled over SSH, ensuring authentication and encryption of all traffic between the local user and the Cloud instance. For the uploading of a model and running test data on NS1e we chose HTTPS protocol between the cloud instances and the local gateway server, ensuring that this traffic also could not be compromised.

The installation we had finally was built on 7 Bare Metal Servers, each with two NVIDIA K80 GPUs and 16 NS1e boards. On those systems we had up to 80 simultaneous users accessing the instances over WIFI, with parallel training and testing on the NS1e board. The only problem with this installation was the public / private key handling with tunneling, due to its different handling between Windows and Linux/MacOS laptops. After passing this hurdle the installation ran very smoothly without any major issues and provided an environment to concentrate on the task of learning the new software and not dealing with installation and incompatibilities.

It was beautiful to see 80 people simultaneously using the infrastructure!

Line Separator

 

Guest Blog by Jeffrey L. Mckinstry and Timothy Melano on Developer Workshop.

On first day, May 23, IBM Team presented new results on convolution networks, NS1e16 system, and NS16e system. This was followed by exciting new results from many of the participants -- there are now 16 publications from Boot Campers!

From May 24-26, we ran a hands-on Developer's Workshop in which the attendees were able to use for the first time our Energy efficient Deep Neuromorphic (Eedn) networks on their own data and run these networks on TrueNorth. The reunion had a variety of sessions ranging from theoretical deep dives into the mathematical abstractions of how we have mapped convolutional networks and backpropagation onto TrueNorth, to hands-on tutorial sessions. After a few hours of guided Eedn exercises, we took the training wheels off and let the students use new datasets to create state-of-art neural networks. By the end of the day students had learned how to use the tools from our latest software release and were launching overnight training runs.

The next morning, May 25, students were excited to share very competitive scores on their datasets. The energy was great. Our guests from the Air Force Research Lab were our first success of many. On a aerial radar dataset their scores were comparable to the performance they were getting on other unconstrained networks, the difference in this case was that they were classifying images at 1000 frames per second, while minuscule amount power!

Later in the day our friend, Diego Cantor from the University of Western Ontario, was getting better results on an Eedn network than on his unconstrained Caffe networks! His data story is actually quite entertaining because his original image sizes were too big to be efficiently fed into a single TrueNorth chip. He was quite skeptical when we asked him to downsample his images to 32x32, but was later shocked to see that with smaller images and a sparse Eedn Convnet he was able to beat his Caffe convnet, a hat tip to brain-inspired computing indeed. His data is of ultrasound images of the human spinal column.

Developer Workshop

There were lots of other successes such as Tobi Brosch achieving 87 percent on an 14 category action recognition dataset and Garrick Orchard achieving 98.7 percent accuracy on spiking MNIST dataset; but the real success was that all of the researches in attendance were training Eedn networks and visualizing high accuracy results very quickly.

Developer Workshop

Line Separator

 

Some Attendee Comments.

"I can make networks in a day now :)"

"I truly appreciate all of the effort from the IBM staff to put these events together; they are immesurably beneficial for this community. Not only do we gain technical insight, but we have the opportunity to reconnect as a community. I thought the use of docker was inspired. It really simplified the process of baselining all participants without the need for complex, time consuming installations. I plan to emulate this process for my lab server. I loved having the bootcamp reunion at Almaden. What a truly beautiful and inspiring location to work and gather as a community. ... What I like most of all is how welcoming IBM is to all of us. I think I speak for this entire group when I say that you always make us feel like we are all part of something important."

"The docker setup that provided an existing installation was genius. That helped us focus our time and attending on learning Eedn instead of the installation."

"The reunion has been great - as was last year's BootCamp. An excellent hands on tutorial for training and deploying CNNs on TN. I feel that I could train my own networks, and this is huge for me because I do not come from a formal machine learning background. I applaud the IBM team for putting together another great workshop."

"The technology is impressive, from our perspective its ideal for robotic applications. We are excited to get this back to our lab and put this on our robotics platform to see how well this operates in our environments. ... The reunion, the format, and the hosting has all been absolutely outstanding. It's nice to have a lot of IBM folks around that have been responsive to all of our questions and concerns. You all clearly spent a lot of time preparing tutorials, documentation, etc. and this preparation has helped and is well appprecaiated."

"The fact that participants were able to get get up and running so quickly (and complained so little) reflects very positively on the tools and preparation."

"I've really enjoyed the reunion. I wasn't at the bootcamp so this is the first chance I've had to use the tools and found them very well constructed."

"I found this training to be very helpful in understanding how to implement neural networks on the TrueNorth hardware. Contained in only 4 days I think the training had a good balance of theory, applications and hands-on projects. The use of Docker containers to implement MatConvNet was very easy to access and deploy and the background information sent prior to the bootcamp made it easy to get up to speed quickly. What other groups have done with the chip in such a short time was quite impressive."

"Overall the presentations were layed out very well and were conducive to good information flow. The responsiveness of the team to help field any and all questions was well-receivied and invaluable; if someone did not know how to answer a question they actively sought someone on the team who did."

"For a 3.5-day crash course on TrueNorth (compared to a 3-week long boot camp), this is a success and probably the best one can organize for a such short duration workshop. I would like to thank everyone at IBM for making it possible. I was able to work through the tutorials at my own pace, be immersed right away, and customize without starting from scratch."

"It has been interesting and inspiring to see the breadth and extent of work that has been done by BootCamp participants, and to see how the tools have evolved since last summer."

"Excited about the new convolutional network capabilties."

"I want to start by saying thank you to all that put forth so much time and effort into this workshop. I wish I could have been here last year at the first bootcamp, but you all have made it very easy to get up to speed. I like the structure and format at this workshop, it was well conceived and executed."

April 08, 2016

Mighty oaks from little acorns grow.

A 65,536 times growth in number of neurons in less than 6 years. From a chip with 256 neurons in August 2011 to 1 million neurons in TrueNorth August 2014 to 16 million neurons for LLNL NS16e System in March 2016.

March 28, 2016

PREPRINT: Convolution Networks for Fast, Energy-Efficient Neuromorphic Computing

Guest Blog by Steven K. Esser

Today, to seek feedback from fellow scientists, my colleagues and I are very excited to share a preprint with the community.

Title: Convolution Networks for Fast, Energy-Efficient Neuromorphic Computing

Authors: Steven K. Esser, Paul A. Merolla, John V. Arthur, Andrew S. Cassidy, Rathinakumar Appuswamy, Alexander Andreopoulos, David J. Berg, Jeff rey L. McKinstry, Timothy Melano, Davis R. Barch,  Carmelo di Nolfo, Pallab Datta, Arnon Amir, Brian Taba, Myron D. Flickner, and Dharmendra S. Modha 

Abstract: Deep networks are now able to achieve human-level performance on a broad spectrum of recognition tasks. Independently, neuromorphic computing has now demonstrated unprecedented energy-efficiency through a new chip architecture based on spiking neurons, low precision synapses, and a scalable communication network. Here, we demonstrate that neuromorphic computing, despite its novel architectural primitives, can implement deep convolution networks that i) approach state-of-the-art classi cation accuracy across 8 standard datasets, encompassing vision and speech, ii) perform inference while preserving the hardware's underlying energy-efficiency and high throughput, running on the aforementioned datasets at between 1100 and 2300 frames per second and using between 25 and 325 mW (effectively > 5000 frames / sec / W) and iii) can be specifi ed and trained using backpropagation with same ease-of-use as contemporary deep learning. For the fi rst time, the algorithmic power of deep learning can be merged with the efficiency of neuromorphic processors, bringing the promise of embedded, intelligent, brain-inspired computing one step closer.

Link: http://arxiv.org/abs/1603.08270

At this juncture, I hope that a personal retrospective will help you in sharing my enthusiasm.

In the Winter of 2008, I graduated school in Wisconsin and moved to California over the New Year to begin working in what was to become IBM’s Brain Inspired Computing lab. The Team had just won DARPA SyNAPSE contract. Arriving, I joined a handful of researchers whose enthusiasm quickly pulled me into sharing their lofty vision -- to build a computer designed from the ground up along the lines of the mammalian brain, and to have that system make a beneficial impact on society. Our goal was clear; our route was not. From the beginning, we chose to take a different approach from rule-based artificial intelligence, whose algorithms, though able to do some of the things our brain can do, under-the-hood work nothing like the brain. We also chose to take a different path from traditional artificial neural networks, which use neurons and synapses -- the basic computational elements of the brain -- but ignore limits on precision and connectivity that are critical for low power operation. Energy-efficiency is, after all, critical to creating scalable or embedded systems (and indeed our own brain uses only as much power as a typical lightbulb). At the start, this decision left us with no suitable existing approaches capable of tackling real world problems, as many of our scientific peers were more than willing to remind us! This gave me pause, but the challenge was fascinating, and so with the optimism of a recent graduate, I dove in.

As work progressed, the lab had the fortune to build an incredible team of hardware researchers. They created a prototype brain-inspired core in 2011 and the four thousand core TrueNorth chip in 2014, giving us the hardware needed for fast, low energy “neurosynaptic” computation. On the algorithm front, we began to internalize the architecture and created a number of small scale demos in 2012 and 2013 and also built a programming language that provided the necessary foundation for later work. These showcased certain capabilities, but were custom solutions to specific problems. To have a major impact, we needed a truly general purpose, scalable approach for network creation.

To this end, our path led back to neural networks in the form of modern deep learning, which can achieve state-of-the-art accuracy across a broad range of perceptual tasks. Though the traditional methods of deep learning are not directly compatible with our chip, to our great surprise and relief, we found that they are extremely resilient to reduced precision and connectivity, as necessitated by TrueNorth. From this insight, we were able to adapt the canonical backpropagation learning rule used in deep learning for compatibility with TrueNorth. At NIPS 2015 last year, we demonstrated near state-of-the-art accuracy on a canonical handwritten digit recognition challenge, using orders of magnitude less energy than the best previous results. As this work was ongoing, we were very encouraged to see research from a number of other laboratories demonstrating the power of deep learning for low precision computation.

Today, this preprint builds on upon many previous efforts. In this work, we demonstrate a method for adapting convolutional neural networks, a powerful tool for deep learning, to create networks that run on the TrueNorth chip. We achieve near state-of-the-art accuracy on 8 datasets spanning color image and speech, while running on real hardware at between 1100 and 2300 frames per second and using between 25 and 325 mW, which is effectively > 5000 frames / sec / W. This is important because approaching state-of-the-art accuracy within neuromorphic constraints was previously believed to be difficult, if not, impossible, and because the ensuing speed and energy efficiencies open up an entirely new operating regime not accessible via conventional computing.

This work is a major personal milestone in a journey that began over 7 years ago and I cannot wait to see the breakthroughs that the next 7 years will bring!

A Scale-up Synaptic Supercomputer (NS16e): Four Perspectives

Today, Lawrence Livermore National Lab (LLNL) and IBM announce the development of a new Scale-up Synaptic Supercomputer (NS16e) that highly integrates 16 TrueNorth Chips in a 4×4 array to deliver 16 million neurons and 256 million synapses. LLNL will also receive an end-to-end software ecosystem that consists of a simulator; a programming language; an integrated programming environment; a library of algorithms as well as applications; firmware; tools for composing neural networks for deep learning; a teaching curriculum; and cloud enablement. Also, don't miss the story in The Wall Street Journal (sign-in required) and the perspective and a video by LLNL's Brian Van Essen.

To provide insights into what it took to achieve this significant milestone in the history of our project, following are four intertwined perspectives from my colleagues:

  • Filipp Akopyan -- First Steps to an Efficient Scalable NeuroSynaptic Supercomputer.
  • Bill Risk and Ben Shaw -- Creating an Iconic Enclosure for the NS16e.
  • Jun Sawada -- NS16e System as a Neural Network Development Workstation.
  • Brian Taba -- How to Program a Synaptic Supercomputer.

The following timeline provides context for today's milestone in terms of the continued evolution of our project.

Timeline
Illustration Credit: William Risk

Line Separator

 

First Steps to an Efficient Scalable NeuroSynaptic Supercomputer

Guest Blog by Filipp Akopyan

 

Recently, IBM's Brain-inspired Computing Team has revealed the world's first 1 million-neuron evaluation platform for mobile applications (NS1e), based on IBM's TrueNorth (TN) neurosynaptic chip. Afterwards, we demonstrated the first 16 million-neuron scale-out system (NS1e-16) assembled using 16 instances of the NS1e board along with supporting periphery, which includes a host server, network router, power supervisors, and other components. Detailed Information on NS1e-16 may be found at [Revealed: A Scale-Out Synaptic Supercomputer (NS1e-16)].

NS1e-16 is a powerful system, but we needed a bigger challenge. Why not build a more compact, more efficient 16 million neuron system that can fit in your shoe box?! We have had some initial prototypes of such a system before, but nothing that could actually be delivered to our customers and partners. So, with the support of LLNL, we embarked on a journey to NeuroSynaptic 16 million-neuron evaluation platform, NS16e (note the subtle but significant lettering difference from the earlier NS1e-16 system).

NS16e evaluation platform consists of the following main components: a custom 4×4 board, a custom Interposer board, and an Avnet off-the-shelf Zynq SOC mini module, AES-MMP-7Z045-G. The three boards are assembled into a single NS16e structure using vertical stacking/mating connectors to supply power to all the boards and to exchange data and control signals.

The 4×4 board contains 16 programmable TrueNorth chips capable of implementing large-scale (up to 16 million neurons) neural models for various applications. Each TrueNorth neurosynaptic chip has 4092 usable cores with each core containing 256 neurons, 256 axons, and 65536 synapses. To maximize the communication bandwidth with the TrueNorth chips on the 4×4 board, we use data port expanders, which access the TrueNorth input/output ports. A similar set of IC-s is used to perform the configuration of all the TrueNorth chips using scan chains. The 4×4 board also hosts all the TrueNorth power domain regulators, power supervisor, current sensing circuitry and several SPI/I2C programmable devices.

The Interposer board provides high speed interfaces and power domains for all the NS16e system components. It contains a PCIe x4 connector for high-speed communication with a host machine and an SFP+ Ethernet cage (which is not currently enabled). The Interposer also contains a power supervisor and a series of electronic fuses (along with the 4×4 board) meant to shut down the system in case of a power failure. For debugging purposes the Interposer board (as well as the 4×4 board) contains LED-s, which may indicate failures on power domains. The Interposer also hosts a JTAG connector for programing the Zynq module.

The Zynq module provides an FPGA fabric for implementing custom glue logic to control and potential modify (if there is such a requirement) the incoming/outgoing data to and from NS16e. Specifically, the Zynq fabric may be used for DMA, data-to-spike transduction, filtering, etc. The high speed transceivers on the SOC (system-on-chip) are used to provide a PCIe communication bridge between NS16e and the host system. The Zynq SOC also contains two ARM processor cores, which are not enabled on the current revision of NS16e to minimize power consumption.

The team created a specification, schematics and performed the physical design (layout) of the system. We have then manufactured the boards and assembled them using the latest cutting-edge electronic components.

So without further ado, we present to you the first NS16e prototype system. Figure 1 shows the assembled 4×4 board (no chips); Figure 2 shows the assembled Interposer board.

Top of 4×4 board Bottom of 4×4 board

Figure 1. NS16e: 4×4 board TOP on the left, and BOTTOM on the right.

 

Top of interposer board Bottom of interposer board

Figure 2. NS16e: Interposer board TOP on the left, and BOTTOM on the right.

 

Since the current number of TrueNorth chips is limited, we have decided to use an assembly risk aversion technique and populate the initial few systems without soldering down the TrueNorth chips. This gives us an opportunity to test the complete design of the system up to (but not including) the actual TrueNorth chips. Figure 3 shows the two boards side-by-side, ready for stacking on top of one another (Zynq module not depicted).

4×4 and Interposer boards

Figure 3. NS16e: 4×4 and Interposer boards side-by-side (ready for mating).

 

As one can imagine, the bring-up of such an unconventional system is not a trivial task. We have developed and put in place various hardware, firmware and software tests to verify the functionality of the system.

On these first prototypes we have provisioned for special type of sockets, which can be populated post-assembly to host the actual TrueNorth chips. Once we have fully verified the functionality of the two newly designed boards without the chips, we have populated the 4×4 board with TrueNorth chips by the means of aforementioned sockets, as shown in Figure 4.

NS16e: System populated with TN-chips using sockets

Figure 4. NS16e: System populated with TN-chips using sockets.

 

Power up of this platform has to be performed by stages with extra care to make sure that the current draw of the system is within reason (i.e. no shorts); all the key interface signals between TrueNorth chips and glue logic are monitored with oscilloscopes for proper operation and signal integrity. The low level bring-up of the full system is depicted in Figure 5.

NS16e: Low level bring-up with TN-chips in sockets

Figure 5. NS16e: Low level bring-up with TN-chips in sockets.

 

Bringing-up and debugging complex boards takes at least several days (with luck on your side). This platform was no exception. Once we had all the low level tests running correctly on system with sockets, we have given a green light to our board assembler to create the first prototypes with soldered down TrueNorth chips. The first 4×4 with soldered chips is shown in Figure 6.

First NS16e prototype with soldered down TN-chips

Figure 6: First 4×4 prototype with soldered down TN-chips.

 

Now that all the low level tests have been successfully completed, we connect the new NS16e systems to the host servers (Figure 7) using several interfaces: Xilinx JTAG for Zynq programming, Lattice JTAG for power supervisor and FPGA programming, PCIe gen2 link for data exchange between host server and TrueNorth chips.

NS16e systems connected to host machines NS16e systems connected to host machines

Figure 7: NS16e systems connected to host machines.

 

At this point the high level testing takes over and we start running neural network algorithms and applications right on the new NS16e system.

Some of us (myself included) like to work on the weekends, since it is quiet and the janitors don't bother you when you stay late! Also there are no mandatory meetings, so it is just you and the pure drive to create the world's most advanced neurosynaptic system! The only problem is that the cafeteria is closed ... so we have to get pizza!

NS16e application testing

Figure 8. Scott Lekuch having a bite to eat and performing NS16e application testing simultaneously.

 

At the end of this exciting journey we have successfully brought up the NS16e prototypes. Now with the help of our highly-talented software team we can run complex neural applications on this powerful platform. Check out the blogs by Jun Sawada and Brian Taba for the application details. Advanced image/sound recognition and classification in real time on 16 million artificial spiking neurons, anyone?!

Line Separator

 

Creating an Iconic Enclosure for the NS16e

Guest Blog by Bill Risk and Ben Shaw

 

For deployment to LLNL, the three-board stack that Filipp Akopyan described above needed to be housed in a protective enclosure. Since the NS16e system is a first-of-its-kind neurosynaptic supercomputer, we wanted to create an interesting, iconic enclosure design that would reflect this novelty, while securing the boards and allowing access to the required connectors, switches, and indicators. Since the board configuration was already fixed, the enclosure had to be designed to accommodate the existing placement of these components and the irregular shape of the three-board stack.

These goals created a challenging design problem, but fortunately, we have been collaborating for several years with the highly creative IBM Design Team, who had previously designed concept models showing potential applications of the TrueNorth chip [Cognitive Apps ]. Aaron Cox (Industrial Design) had worked with us earlier to create an iconic gold cover for the chip (Figure 9). The design of this cover—with tabs in the four cardinal directions—expresses a key feature of the TrueNorth chip: its built-in ability to be tiled with other chips in a two-dimensional array. Since the NS16e is the first system we've built that fully exploits this tileability, we wanted to ensure that the enclosure made the chip caps visible, to dramatically highlight the 4×4 array of chips contained in the system (Figure 10). Working on the overall personality of this research prototype system, Camillo Sassano (Industrial Design) and Kevin Schultz (User Experience) created a design that uses sharp angled surfaces to give the enclosure a shape that appears to change with the viewing angle. A 4×4 array of chamfered pockets on the front features and emphasizes the 3D effect of the golden chips caps, and light accents complement the sculptural geometry of the device.

TrueNorth chip cap

Figure 9. TrueNorth chip cap. The 414 refers to the number of synapses on a single chip (414=256 million).


 

NS16e enclosure

Figure 10. NS16e enclosure.

 

The need to adapt the enclosure to the existing board stack—for example, to actuate pushbutton switches on the edge of one of the PC boards—required the design of some clever mechanisms. A 3D-printed slider provided the ability to press one switch (Figure 11, left panel). In the case of the other pushbutton, a rocker mechanism was used; however, since we also had to provide visual access to an adjacent LED, a light pipe was incorporated into the rocker mechanism, which was 3-D printed in a transparent material to serve both functions (Figure 11, right panel).

Slide mechanism Rocker mechanism

Figure 11. Slide mechanism (left) and rocker mechanism (right).

 

We wanted the enclosure to support two modes of use: one sitting on a desktop as a standalone unit; the other, placement within a 2U-high drawer for mounting in a server rack. The enclosure was designed so that it could be used in either mode, by attaching either a support foot to allow it to stand on its own (Figure 12) or an adapter that allows it to be mounted in the drawer. In the rackmount mode, the NS16e can be laid flat in the closed drawer for normal operation; the drawer can be opened and the NS16e tilted up for display or maintenance. (Figure 13). Since the enclosure is not visible when the drawer is closed, we added some panel graphics and LED lighting to identify and distinguish the unit.

Desktop mode

Figure 12. Desktop mode.

 

Rackmount mode

Figure 13. Rackmount mode.

 

Once mounted in the enclosure, the system is ready for final testing and delivery to LLNL!

Line Separator

 

NS16e System as a Neural Network Development Workstation

Guest Blog by Jun Sawada

 

It all started very small, I still remember the first time when we first simulated a single TrueNorth neuron model. It was only on a circuit simulator. A few years later, we have fabricated it into the TrueNorth chip, with 1 million neurons on a stamp-size silicon piece. Today, we have built the NS16e system.

Unlike the scale-out system we built earlier using single-chip boards [Revealed: A Scale-Out Synaptic Supercomputer (NS1e-16)], this machine is intended toward running larger, more powerful neural networks in real time. In a typical setup, the NS16e system is connected to a host x86 server computer by a PCI Express link. The host server can pump large data in and out of the NS16e system. The host server with GPUs can build and train a large neural network, and NS16e can immediately start running it. We can run with a single command the entire end-to-end development process of preprocessing training data, neural network construction, training the network, and optimizing the network for hardware, and running it on the NS16e. Because this development process is very quick on NS16e, NS16e system combined with a powerful host server is a dream machine for large scale neural network development.

In order to tell you how the machine works, please see the system overview in Figure 14. The core neural computation in NS16e takes place in the grid of TrueNorth chips, tiled in an 4x4 grid array. The TrueNorth chip is designed in such a way that, if you connect chip to chip by wires, chips start talking to each other using spikes. All the communication takes place using an asynchronous protocol, without any external clocks or additional interfacing chips.

NS16e system overview

Figure 14. NS16e system overview.

 

A host server is connected to NS16e by PCI Express link. It can send and receive 500Mb/sec of data between the NS16e and the host server. The FPGA's on the NS16e system work like a bridge between the TrueNorth chips and the host server. In a sense, it is a translator between a conventional von-Neumann computer and a TrueNorth-based neurosynaptic computer, which speak in different languages. A digital von-Neumann computer operates in instructions and binary data, while the neurosynaptic computer talks in neuron spike signals.

When you enter a command to run a neural network model to the host computer, here is what happens. The host computer sends the network model over the PCI Express link, and loads the neural network model to the TrueNorth chips on NS16e. The terminal window on the computer shows the progress of uploading data to NS16e. After 30 seconds or so (depending on the size of the network model), the uploading is completed, and the neural network starts spiking actively. On the host computer screen, you can only see a summary of how many output spikes are generated and how fast all the neurons are updated. Neuron spikes themselves are encoded in a way that is not easy for a person to understand. That is where we use visualizers. It decodes the spikes coming out of NS16e, and visualizes what the NS16e is trying to say.

Let's say we are going to run an image recognition task on NS16e. We upload the neural network to NS16e, and then we send neuron spikes encoding photo (A) in Figure 15 to NS16e. NS16e produces answers in neuron spikes, and send them back to the visualizer program. The visualizer decodes the spikes and shows picture NS16e chooses like (B) in Figure 15.

Image recognition by NS16e

Figure 15. Image recognition by NS16e.

 

The system is still evolving. We continuously create and test new learning algorithms, new model generation techniques, and new optimization algorithms. One unique thing about this machine is that it needs some optimization on how we map the logical representation of a neural network to physical circuits. In our brain, a certain portion of the cortex is responsible for visual recognition and other parts are responsible for motor function. The NS16e has something similar. You may assign which chip is holding which part of a large neural network. For example, we can make each chip to hold a single layer of multi-layer neural network, or we may assign each chip a dissection of the entire multi-layer network corresponding to a patch of an image. This optimization problem, we call "core placement", is a very unique problem to NS16e-like neurosynaptic machines. This makes some network runs more efficiently and faster, because inter-chip communication is much slower than intra-chip communication. This is one place where we are trying different algorithms and techniques one after another.

The NS16e system is truly a product of great team work. Based on the TrueNorth chip design we worked on in the past, NS16e requires new circuit boards, FPGA programmable logic, system software, placement optimizers, neural network development tools and training algorithms. Only when all of these components start to work together, the whole system begins to produce meaningful answers. It is like creating an entire ecosystem from scratch and integrating them together. We also spent a lot of effort to perfect the system. We have been running countless tests on NS16e and compared the results against our simulation results, in order to make sure that the finished NS16e system has no malfunctioning components. We tracked down every hardware and software issue very earnestly. System testing and documentation team cleaned up many imperfections in our software and documentation. This makes me very proud of the NS16e system, which is a result of diligent work of many people over many years.

At the end of this story, I should say that the NS16e system is just a step forward to much larger scaled systems. Today's convolutional deep learning network is growing rapidly in size, we would like to build much bigger machines than NS16e. Future scaled-up version of NS16e will certainly run much larger neural network using a fraction of energy consumed by CPUs and GPUs. Perhaps one day we may see a single rack of neurosynaptic system with as many neurons and synapses as in a human brain.

Image Recognition by NS16e

 

How to Program a Synaptic Supercomputer

Guest Blog by Brian Taba

 

So how do you actually write programs for a 16-million-neuron synaptic supercomputer? Over the last year, we have built an integrated software ecosystem around a stack of end-to-end TrueNorth development tools. We have bundled this ecosystem into a TrueNorth DevKit for release to partners like LLNL and the many other alumni of last year's Boot Camp.

Let's examine one of the reference applications included in our DevKit, a simple TrueNorth image classifier. This example uses a GPU to train a convolutional neural network (CNN) on a standard benchmark dataset, and then deploys the trained network on the NS16e. For more about our algorithm for training CNNs for TrueNorth, see the paper just posted by Steve Esser. Here, we will focus on the developer workflow.

Image classifier run flow

Figure 16. Run flow for a TrueNorth image classifier.

 

The basic structure of an generic TrueNorth image classifier is laid out in Figure 16. We first acquire a stream of input data, which could be image files read from disk, frames from a webcam, or even natively generated spikes from a Dynamic Vision Sensor (DVS). Raw data might be preprocessed into features like multi-scale edge maps, before being encoded as spike streams for input to TrueNorth. Depending on how the application is distributed across system components, preprocessing and encoding might happen in a server CPU, in the board's embedded ARM cores or FPGA, or intrinsically in a spiking sensor. Input spikes can be processed by TrueNorth chips on an NS1e or NS16e board, or by the NSCS functional simulator. Finally, output spikes are decoded into classification predictions that are typically sent off-board for visualization.

In this instance, we train a TrueNorth network to classify images from the CIFAR-10 and CIFAR-100 datasets. We simplify the problem by preprocessing and encoding the entire dataset offline. The resulting input spike file is looped through TrueNorth hardware to generate class predictions for the images in the test set, which are streamed to a workstation for visualization.

Image classifier development flow

Figure 17. Development flow for a TrueNorth image classifier.

 

The chain of actions required to train and deploy a TrueNorth image classifier is shown in Figure 17. Out of the box, the DevKit provides at least one tool to fill in every block in this chain, but you can swap out any of our tools for your own, as long as you implement the same interface. For example, the DevKit includes two equivalent tools for reading and writing LMDB databases—a command-line utility called tn-signal-processor, of which more later; or a set of MATLAB MEX files—but as long as the downstream blocks get an LMDB in the right format, they don't care whether you used one of our tools to make it, or rolled your own.

Dataset/Preprocessor

The first steps in the training workflow are just basic data science. We import the original dataset into a standard format that all of our tools understand, then preprocess it into the features we will use at runtime. We store data in LMDB databases, which are popular for deep learning because their fast read access lets training data pump quickly into a GPU.

To import and preprocess data, we provide a fast C++ command-line utility called tn-signal-processor. Written by David Berg, tn-signal-processor is a Swiss Army knife for manipulating data—it can import JPG and PNG files into our LMDB format, crop, rotate, apply center-surround filters, encode images as spikes, decode spikes into images, visualize data, and more. It can also convert Caffe's LMDB format to ours and back, so anyone who has already imported their data into Caffe can easily transfer it to the TrueNorth ecosystem, and vice versa.

Trainer/Corelet

Now that we have our data in the proper format, it's time to train a classifier. TrueNorth Convolutional Networks (TNCN) is a framework for composing and training convolutional neural networks that automatically satisfy the TrueNorth design constraints. Created by Steve Esser using the Corelet Programming Environment (CPE), TNCN frees a data scientist to focus on abstract network properties like layer order, filter size, and data precision; without getting bogged down in the nuts and bolts of how to configure the 23 free parameters in a TrueNorth neuron to represent data as 1-bit spikes instead of 32-bit floats, for example.

TNCN implements high-level classes for common CNN layers like convolution, pooling, dropout, loss, etc. Each layer class has a corresponding corelet that automatically compiles its parameters into legal TrueNorth core configurations. The result is a model file containing a list of logical core parameters that collectively implement the trained TNCN network.

To accelerate training with a GPU, we use a deep learning framework called MatConvNet, which wraps NVIDIA's cuDNN primitives in binary MEX files that are easy to invoke in MATLAB. MatConvNet is not one of the famous frameworks like Caffe or Torch, but we picked it because we already needed MATLAB for CPE and it's very convenient to have a single development environment for codesigning TNCN layers and corelets. Of course, when we want to run many training instances in parallel on a GPU cluster, it's awkward and expensive to activate a MATLAB license on every compute node, so we also provide a compiled version of TNCN that can be run at the command line using the free MATLAB Runtime.

Placer

The final step before deploying this network to hardware is to assign the logical cores in the model file to physical cores in the NS16e chip array. This can be a challenge in multi-chip systems, due to the bandwidth bottleneck at the interface between chips. If cores are sending too many spikes across chip boundaries, spikes might not all arrive at their destinations within the design window of 1 millisecond per simulation tick. Of course, we can always increase the tick period to close timing. But it's better to reduce cross-chip traffic if possible, by being smarter about where we place cores in the chips.

For the LLNL system, Pallab Datta wrote a new tool from scratch that heuristically places logical cores in the TNCN model file at physical locations in the NS16e core array (Figure 18), based on minimizing chip crossings in the core-to-core connectivity graph. The Neuro Synaptic Core Placer (NSCP) is critical for making large TrueNorth models like a multi-chip TNCN network run successfully in hardware.

Physical core placement on NS16e

Figure 18. Physical core placement for a 4-chip CIFAR-10 network on NS16e.

 

Application

Now we can run our trained TrueNorth classifier in hardware, and see it in action! To decode, evaluate, and visualize the test image labels predicted by hardware, we have another command-line utility called PatchClassifier that can be configured to lay out an array of category icons for a given image dataset, with the predicted category outlined in green if correct and red if incorrect. The length of the bar beneath each category indicates the confidence of the classification.

Figure 19 is a screencap of the predicted CIFAR-100 test image classifications being streamed from the NS16e to the PatchClassifier visualizer running on a laptop. In this case, the NS16e correctly classified the test image as a whale (green bar), and its next guess would have been a dolphin (white bar).

Physical core placement on NS16e

Figure 19. PatchClassifier visualizer configured for CIFAR-100 dataset.

February 29, 2016

A Beautiful Animation

To illustrate one of my recent patents, IBM created the following beautiful animation! Enjoy.

Mystery Box

December 15, 2015

Revealed: A Scale-Out Synaptic Supercomputer (NS1e-16)

Guest Blog by William P. Risk and Michael V. Debole with Contributions from Rodrigo Alvarez-Icaza and Filipp Akopyan.

A few months ago, we unveiled the NeuroSynaptic Evaluation (NS1e) board, which contained a single TrueNorth chip, along with circuitry for interfacing the chip to sensors and real-world data. These boards were used in our August 2015 “Boot Camp” event, in which participants learned how to program the chip to implement cognitive systems [Brain-Inspired Computing Boot Camp Begins]. During BootCamp, each NS1e board was housed in its own plastic case, and for convenience, we built a rack to hold the 48 boards used during that event. Although the rack nicely organized and displayed the boards, a bulky assembly of power strips, ethernet switches, and servers were also required for their use.

Recently, a government client requested that we build a system of 16 NS1e boards, with power unit, ethernet switch, and Linux server all housed in a compact, self-contained unit, where each of the NS1e boards can be seamlessly integrated, but mounted in such a way that any individual board could be swapped in or out easily. This requirement led us to explore designs in which individual NS1e boards are mounted on cards that could be inserted vertically into a card rack (Figure 1) and all elements were mounted in a small desktop rack unit (shown below).

“Fig1
Figure 1. NS1e Card Rack

 

We initially considered two similar designs, both using a 6U high desktop rack with components stacked as follows: (bottom) 1U – power-strip / network switch, 3U – NS1e card rack, 1U NS1e card power, 1U server. We ultimately chose the design with wiring in the back as it provided a cleaner looking front panel.

“Fig2
Figure 2. Final Design Concept

 

The next step was determining how to make the concept become a reality. For the most part, this was a straightforward process since we were able to use many off-the-shelf components (server, network switch, power strip, etc..). However, powering 16 NS1e boards required a bit of engineering to reduce the space required. As standalone boards, each is typically powered by an AC-DC adapter which simply plugs into a standard outlet, but including 16 bulky “wall warts” in a 1U form factor was impractical. In addition, we wanted to provide the capability to remotely monitor the current consumption of each individual board and to control its power state. To solve this problem we turned to a USB-style power distribution module developed by Cambrionix. While normally intended to charge and sync cell phones and tablets, it’s port capacity (16 USB ports) and current limits were suitable for our purposes. However, with typical USB connectors plugged into the Cambrionix board, the height required was close to 2U (3.5″), greater than the 1U we had allocated for the power distribution unit in the initial design. Fortunately, the card rack holding the NS1e boards did not occupy the full depth of the rack and we had just enough room to design a step-down enclosure using 1U of space above the NS1e drawer and dropping down to 2U in the back (See below). Finally, to give some visual appeal, we united the 16 individual NS1e boards by spreading a graphic (our award-winning visualization of the network diagram of the monkey brain) across their front panels and added some accentuating LED strip lighting on both sides of the drawer and below the chassis.

Building the system, once all the planning was complete, was relatively straightforward:

Build Evolution

“Fig3
Figure 3. Initial Skeleton

 

“Fig4
Figure 4. Early Prototype Front and Back

 

“Fig5
Figure 5. Functional Prototype (Alpha)

 

“Fig6
Figure 6. Functional Prototype (Beta)

 

“Fig7
Figure 7. Custom Power Enclosure

 

“Fig8
Figure 8. Final Lab Photo

 

“Fig9
Figure 9. Final Photo

 

Then crated and shipped!

 

“Fig10
Figure 10. Preparing to ship system to clients

 

The end result is a system that provides 16 million neurons and 4 billion synapses in a package about the size of a carry-on suitcase!

modha-web.jpg

Categories

  • Talk
Creative Commons License
This weblog is licensed under a Creative Commons License.
The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.