Tensorflow on edge, or – Building a “smart” security camera with a Raspberry Pi

brutus1213 · on June 21, 2020

Nice writeup but the Raspberry Pi isn't running tensorflow. It is mentioned in the article that the author is sending images to an edge machine.

The big question I had was about hardware video encoding/decoding ... doesn't really cover that. I've found sending single image frames over zeromq to be fairly limiting if you care about high frame rate/low latency processing.

Key issue I have run into is while many chips support hardware video encoding/decoding, the APIs to interface with this aren't there or not in open source. Anyone who has ideas on this, I'd welcome your comment.

As an aside, another option is to run Intel's Movidus USB stick (aka Neural compute stick) and then you get a smart camera on the raspberry pi itself. That raises other issues though.

snowzach · on June 21, 2020

Shameless plug, check out DOODS: https://github.com/snowzach/doods It's a simple REST/gRPC API for doing object detection with Tensorflow or Tensorflow Lite. It will run on a Raspberry Pi. It actually did support the EdgeTPU hardware accelerator to make the Pi pretty quick for certain models. They broke something so I need to fix EdgeTPU support but it's still usable on the Pi withe the mobilenet models or inception if you're not in a hurry.

agibsonccc · on June 22, 2020

Few questions:

1. Did you build this for your own use cases? Interesting side project?

2. How do you feel about the need for base64 being a requirement on the endpoints? Isn't GRPC the wrong medium for this? Also, what do you see as the main limitations right now? The models?

snowzach · on June 22, 2020

1. I built it to integrate with Home Assistant and security systems. I was trying to use Tensorflow on a Raspberry Pi and the dependencies were a nightmare. Tensorflow in general is a nightmare to compile and run IMO. I got to thinking, what if I could make all the deps inside of a docker container. What if I could run it remotely. It was born out of that.

2. As for base64, I'm not sure of a better way to support sending raw image data over JSON (in REST mode) In some ways I think GRPC is a better medium than JSON (it supports either) as GRPC supports sending the RAW bytes. What leads you to believe GRPC isn't the right transport? Plus you can do it in a stream format if you want to do a lot of video.

The only limitations I can think of are that Tensorflow supports a myriad of CPU optimizations so providing a single container image that has all the right options is basically impossible. I created one that has what I think are some of the better options (AVX, SSE4.X) and then an image that basically should run on any 64 bit intel compatible CPU. To get optimized options you need to build the docker container yourself which can take the better part of a day on slower CPUs.

With that said, I also provide ARM32 and ARM64 containers that actually run semi-okay on Raspberry Pis and and other ARM SBCs. I can run the inception model on a Pi4 on a 1080p image in about 5 seconds which is pretty good IMO.

quietbritishjim · on June 22, 2020

> Nice writeup but the Raspberry Pi isn't running tensorflow. It is mentioned in the article that the author is sending images to an edge machine.

Yeah I was a bit surprised by this, and although the article is very clear about it I think it's generated a bit of confusion in the comments here. My understanding of edge computing is that it means the processing of data is done at the point the data is captured, so to me that would mean right there on the raspberry pi. But the author considers their whole LAN to be the "edge", so basically anything that doesn't involve sending the data over the internet:

> ... doing the heavy lifting on a machine physically close to the edge node – in this case, running the Tensorflow Object detection. By doing so, we avoid roundtrips over the internet, as well as having to pay for Cloud compute on e.g., AWS or GCP.

I think their strategy of capturing the data on a very low-power device and then processing on a server on your network is a very reasonable one, I just wouldn't have used that term.

thebruce87m · on June 21, 2020

This is where GStreamer normally steps in. A lot of hardware manufacturers provide a gstreamer plugin for their module. I’ve had experience with NVIDIA and atmel SoCs and that seemed to be the default path.

Good luck with the gstreamer pipeline learning curve however!

NikolaeVarius · on June 21, 2020

Can confirm not fun

agibsonccc · on June 22, 2020

Could you elaborate on some of the problems you had overall?

brutus1213 · on June 22, 2020

I've unsuccessfully dabbled in gstreamer in the past. I was doing a project this weekend, and the comments on this thread motivated to give it another shot .. after a couple of hours (2-4ish?), I was able to get video off the Pi to my desktop (on the same LAN) but the performance was pretty bad. I didn't optimize much yet but let me summarize the key issues I experienced with gstreamer these last few hours:

1) Very little documentation; poorly explained pipelines. I tried to read what docs I could find but things quickly devolved into trying out random gstreamer pipelines posted in comments. People don't explain why they use one particular element over another. So it felt like whack-a-mole.

2) Installing gstreamer on the Pi was a breeze. I wanted to pull video off the connected camera and sent to VLC on my desktop. Sounded like something that would work out-of-the-box? Nope. Kept seeing lots of stackoverflow comments of people stabbing in the dark, getting errors (or have the thing just sit there and not work) with very little feedback on what was wrong.

3) I have very little indication of what is hardware and what is software accelerated in my pipeline. I have no idea where latency is coming into my pipeline.

Overall .. my modern expectation for software frameworks is "batteries included" .. it is totally reasonable for sophisticated software tools to be complex .. but gstreamer is just not designed that way. While I got it to work, I see massive latency (likely because my pipeline is inefficient) and degraded quality (no idea why).

barake · on June 21, 2020

Curious what you found limiting about zeromq? Just not enough throughput for high FPS?

So far I’ve found it to be the sanest multicast solution since clients pull.

brutus1213 · on June 21, 2020

Issue isn't ZeroMQ. The simple/inefficient way to do it is to capture frames one at a time, and send them via ZeroMQ. Video is pretty bandwidth intensive .. the only reason things like YouTube work as smooth as they do is that they use codecs such H264/265 (which are proprietary unfortuantely) and stream compress frames over the network. Now doing the codec in software burns a lot of CPU as this is very math intensive .. most processors support hardware video codecs for this purpose. There are just no open source tools/libraries that make this good/simple enough that I have found.

ww520 · on June 21, 2020

Aren't IP cameras capable of H264 already encoded the video before output it to the network? H264 video stream has very good compression ratio and shouldn't consume too much bandwidth.

brutus1213 · on June 22, 2020

In the project I used zeromq, we were not using external IP cameras. My experience with IP cameras is still full of some seconds of latency .. I have no idea why.

gambiting · on June 22, 2020

Yeah I was hoping it was a raspberry pi maybe using one of these neural nets USB sticks, instead it was just using RPi as a dumb terminal for sending video. You could probably do the same with an old android phone set to stream video over lan.

theblackcat2004 · on June 21, 2020

Since you already using openCV, you can write a neat motion detection and only start sending frames for detection when motions are detected

otter-in-a-suit · on June 24, 2020

Just saw this thread (I'm the author) - great idea, thank you!

simlevesque · on June 21, 2020

Good thinking.

anp · on June 22, 2020

This reminds me of a hypothetical project I would take up if I still had a dog and a small yard: building a poop cleanup map from CV processing of camera footage.

Stepping stone toward a Poopba, obviously.

9nGQluzmnq3M · on June 22, 2020

Coral's Edge TPU products are built specifically for this kind of thing: https://coral.ai/

Hands-on video (4 min): https://www.youtube.com/watch?v=-RpNI4ZrfIM

monkeydust · on June 22, 2020

Interesting have RPi's lying around might get the USB Accelerator

thebruce87m · on June 21, 2020

Is it really edge computing if the pi isn’t running tensorflow? I know the definition is kind of woolly.

I wonder what the performance would be on a $100 jetson nano.

hn_check · on June 21, 2020

The Jetson Nano is fantastic, and I use Tensorflow on it with two home security cameras. It is a wonderful device, and deserves far more attention than it gets.

I'm going to upgrade in the next week to a Jetson Xavier NX. Not because I need to, but because I like playing around and it's a silly powerful device.

I also run a NextDNS CLI client on it, various automation stuff, etc.

brutus1213 · on June 21, 2020

Jetson nano can run Cuda code. It is pretty decent.

The article does employ a reasonable definition of edge computing IMO (scientist who works in this area). The rpi is the client, and the processing happens on a beefy edge node. But yeah .. there is not one clear, accepted definition here.

dnautics · on June 21, 2020

it's not going to the cloud, so I'd say that counts.

acidburnNSA · on June 21, 2020

This is extraordinarily neat.

Home Assistant does have a tensorflow integration [1] that allows you to run other home assistant automations (including various alerts, alarms, and scare sequences) based on person detection with basically any camera (since it's kind of a hub-and-spoke model to all other possible IoT devices).

[1] https://www.home-assistant.io/integrations/tensorflow

I struggled recently to get it running on my actual GPU since I run Home Assistant on a home server. I ended up making a custom component using pytorch instead on Pop OS 20.04 and it works gloriously. CPU usage way down and GPU has something to do now.

My super awesome self-hosted alarm system is now extra-super awesome.

Of course burglars are going to all just start wearing AI adversarial t-shirts.

joshu · on June 21, 2020

I've just started down this path using a plain RTSP-serving camera and a low end box using the Coral EdgeTPU to process the frames. It looks like there are a variety of solutions available. https://github.com/blakeblackshear/frigate https://docs.ambianic.ai/users/configure/ etc

fareesh · on June 21, 2020

I am trying to achieve something similar but at a higher scale.

I have about 48 different cameras where I want to count people and get their approximate location in the frame.

I want to run an object detection model on all of those video streams simultaneously.

My AWS instance maxes out after 7 simultaneous streams so I figured I don't really need real-time monitoring. One frame every couple of seconds, even every minute could potentially suffice, since I am dealing with larger time-frames. Since I don't want to run too many instances at the same time, what are some viable strategies to achieve this?

My plan is to have 5-6 instances of the ML model loaded up and waiting to accept a frame. When one of them is ready, it will instruct one of the RTSP streams to send it a frame, which it will process and store / send the result to an application server. I feel like I may not even be able to consume so many RTSP streams at once (I've never tried so I don't know), so I may have to have some other method of priming the handshake etc. before the model asks for a frame to process.

Is there a better / non-hacky way of achieving this (i.e. managing the workload on a single GPU instance) ?

I don't have any control of the camera hardware at all.

shiftpgdn · on June 22, 2020

48 RTSP streams is a lot of bandwidth to consume at once. Why not use an edge PC or Jetson system to do it in small blocks? A new Jetson Xavier NX can do 8-12 streams depending on FPS and model.

agibsonccc · on June 22, 2020

Hi Fareesh, I'd love to hear more about your use case. Email's in my profile.

milofeynman · on June 21, 2020

For people running Blue Iris you can do something similar to this with blue iris, deepstack[0], and an exe[1] someone wrote that sends the images to deepstack.

Video guide: https://youtu.be/fwoonl5JKgo (Links in the comments of video as well)

[0] https://deepstack.cc/

[1] https://ipcamtalk.com/threads/tool-tutorial-free-ai-person-d... https://github.com/gentlepumpkin/bi-aidetection

8fingerlouie · on June 22, 2020

I did something similar, but because i had no requirement to playback audio "real time", i opted for a simpler solution.

I run a simple video capture from a Raspberry Pi Zero W running motion, meaning all motion events are captured, including leaves blowing in the wind. The captured files are stored on a NFS share per camera.

On the server i then monitor the parent directory for every camera for new files, and run my object detection there, which in turn generates push notifications with a screengrab if certain objects are detected. It also stores a bounding box annotated version of the file. Not really needed except for figuring out why you got an alert without any clear reason.

doing it this way however allows me to save a bit on each camera, and use dedicated hardware for object detection on the server. I currently use an Intel Neural Compute Stick 2 (https://software.intel.com/content/www/us/en/develop/hardwar...), and while it is far from dedicated GPU performance, it is equally far from dedicated GPU power consumption.

josteink · on June 22, 2020

> We’ll use a Raspberry Pi 4 with the camera module to detect video. ... Now, here’s an issue for you: My old RasPi runs a 32bit version of Raspbian.

So why not just use the 64-bit Ubuntu RPi image instead then?

https://ubuntu.com/download/raspberry-pi

alkonaut · on June 21, 2020

Would object detection like this work out of the box for deer like he demonstrates for humans? I need this for deer.

agibsonccc · on June 22, 2020

Hi, could you describe your use case a bit? Just an alarm trigger for deer in the backyard?

alkonaut · on June 22, 2020

Yes, I'd point a camera at my precious vegetables and if a deer walks into the video feed, something that scares it off is triggered so it runs off before eating the whole garden.

canada_dry · on June 21, 2020

I'm hoping advances like YoloV5 [i] will allow a rpi4 to more ably do this without piping the video to another processor.

[i] https://github.com/ultralytics/yolov5

noja · on June 21, 2020

Off topic almost: does anyone know of any (long life) battery powered wifi cameras (with IR) for a project like this? Off the shelf, with a battery life of months and nice looking (like Arlo) but not cloud?

szczepano · on June 21, 2020

esp32-cam microcontroller costs around $6 and you have face detection build in. It have bluetooth and wifi and most of it's drivers code if not everything is on github. Only problem is you need to program it using arduino or other microcontroller hardware.

ed25519FUUU · on June 21, 2020

Face detection is nice but body/person detection is much more useful in these setups.

szczepano · on June 21, 2020

You mean esp-who ? It uses mobilenetv2 so it's quite possible to train it to detect person instead of face. Didn't tried myself, just started playing with it.

xchip · on June 21, 2020

the Raspberry Pi isn't running tensorflow

simlevesque · on June 21, 2020

Tensorflow on the pi itself is hard but I get great results for a similar system with just a rpi4 and opencv.

b34r · on June 22, 2020

What about package delivery people lol

hathym · on June 22, 2020

It's google edge by the way.

staycoolboy · on June 21, 2020

I've done this with a Jetson Xavier, 4 CCTV cameras and a PoE hub. You really want to use DeepStream and C/C++ for inference, not Python and TensorFlow.

I'm streaming ~20 fps (17 to 30) 720P directly from my home IP4 address, and when a person is in-frame long enough and caught by the tracker, a stream goes to an AWS endpoint for storage.

I've experimented with both SSDMobileNet and Yolo3, which are both pretty error prone but they do a much better job filtering out moving tree limbs and passing clouds, unlike Arlo.

You need way more processing power than an RPi to do this at 30fps, and C/C++, not Python. (There are literally dozens of projects for the RPi and TFlow online but they all get like 0.1 fps or less by using Flask and browser reload of a PNG... great for POC but not for real video)

I wrote very little of the code, honestly: only the capture pipe required a new C element. I started with NVidia DeepStream which is phenomenally well-written, and their built-in accelerated RTSP element, and added a custom GStreamer element that outputs a downsampled MPEG capture to the cloud when the upstream detector tracks an object. NVidia also wrote the tracker, you just need to provide an object detector like SSDMobileNet or YOLO. NVidia gets it.

The main 4 camera-pipe mux splits into the AI engine and into a tee to the RTSP server on one side and my capture element on the other side.

It was amazingly simple, and If I turn the CCD cameras down to 720P with h265 and a low bitrate, I don't need to turn on the noisy Xavier fan. The onboard Arm core does the detected downsampling (one camera only, a limitation right now) and pushes the video with a rest endpoint on a node server in the AWS cloud.

I'm very pleased with it, I haven't tested scaling but if I turned off the GPU governors I could easily go to 8 cameras. I went with PoE because WiFi can't handle the demand.

scottlamb · on June 21, 2020

> You need way more processing power than an RPi to do this at 30fps, and C/C++, not Python. (There are literally dozens of projects for the RPi and TFlow online but they all get like 0.1 fps or less by using Flask and browser reload of a PNG... great for POC but not for real video)

I think 8 streams at 15 fps (aka 120 fps total) is possible with a ($35) Raspberry Pi 4 + ($75) Coral USB Accelerator. I say "I think" because I haven't tested on this exact setup yet. My Macbook Pro and Intel NUC are a lot more pleasant to experiment on (much faster compilation times). A few notes:

* I'm currently just using the coral.ai prebuilt 300x300 MobileNet SSD v2 models. I haven't done much testing but can see it has notable false negatives and positives. It'd be wonderful to put together some shared training data [1] to use for transfer learning. I think then results could be much better. Anyone interested in starting something? I'd be happy to contribute!

* iirc, I got the Coral USB Accelerator to do about 180 fps with this model. [edit: but don't trust my memory—it could have been as low as 100 fps.] It's easy enough to run the detection at a lower frame rate than the input as well—do the H.264 decoding on every frame but only do inference at fixed pts intervals.

* You can also attach multiple Coral USB Accelerators to one system and make use of all of them.

* Decoding the 8 streams is likely possible on the Pi 4 depending on your resolution. I haven't messed with this yet, but I think it might even be possible in software, and the Pi has hardware H.264 decoding that I haven't tried to use yet.

* I use my cameras' 704x480 "sub" streams for motion detection and downsample that full image to the model's expected 300x300 input. Apparently some people do things like multiple inference against tiles of the image or running a second round of inference against a zoomed-in object detection region to improve confidence. That obviously increases the demand on both the CPU and TPU.

* The Orange Pi AI Stick Lite is crazy cheap ($20) and supposedly comparable to the Coral USB Accelerator in speed. At that price if it works buying one per camera doesn't sound too crazy. But I'm not sure if drivers/toolchain support are any good. I have a PLAI Plug (basically the same thing but sold by the manufacturer). The PyTorch-based image classification on a prebuilt model works fine. I don't have the software to build models or do object detection so it's basically useless right now. They want to charge an unknown price for the missing software, but I think Orange Pi's rebrand might include it with the device?

[1] https://groups.google.com/g/moonfire-nvr-users/c/ZD1uS7kL7tc...

serf · on June 22, 2020

>* I use my cameras' 704x480 "sub" streams for motion detection and downsample..

i've encountered cheap IPTV cameras where the main high-res stream was actually being offered with a time-shift compared to the sub-stream.

weird shit happens when you have a camera that does that, then you act on data from the sub-stream to work with data on the main stream. I played with a 'Chinesium' cctv with generic firmware that had such a bad offset that I could actually use a static offset to remediate it.

I assumed it was just a firmware bug, since the offsets didn't seem to move around as if it was a decode/encode lag or anything of that sort.

scottlamb · on June 22, 2020

Yeah, that sucks.

Did the camera send SEI Picture Timing messages? RTCP Sender Reports with NTP timestamps? Either could potentially help matters if they're trustworthy.

I haven't encountered that exact problem (large fixed offset between the streams), but I agree in general these cameras' time support is poor and synchronizing streams (either between main/sub of a single camera or across cameras) is a pain point. Here's what my software is doing today:

https://github.com/scottlamb/moonfire-nvr/blob/master/design...

Any of several changes to the camera would improve matters a lot:

* using temporal/spatial/quality SVC (Scalable Video Coding) so you can get everything you need from a single video stream

* exposing timestamps relative to the camera's uptime (CLOCK_MONOTONIC) somehow (not sure where you'd cram this into a RTSP session) along with some random boot id

* allow fetching both the main and sub video streams in a single RTSP session

* reliably slewing the clock like a "real" NTP client rather than stepping with SNTP

but I'm not exactly in a position to make suggestions that the camera manufacturers jump to implement...

staycoolboy · on June 22, 2020

I started with an Rpi by itself. Then I tried a Coral USB stick. I also tried the Intel Neural Compute Stick 2. The Coral USB accelerator doesn't accelerate all of the layers, only some of them. The CPU has to do the rest of the work. Plus, you only get this speed if you preload an image into memory and blast it through the accelerator in a loop. This ignores getting the image INTO the accelerator, which requires reshaping and shipping across USB. It fell to pieces with -one- 720P video stream. The NCS is worse.

I didn't bother with multiple $100 coral accelerators because why when I already have a Xavier?

As I said, my goal was 20-30fps with HD streams. Sure I could drop the quality, but I didn't want to, that was the point.

scottlamb · on June 22, 2020

> The Coral USB accelerator doesn't accelerate all of the layers, only some of them.

My understanding is that with the pretrained models, everything happens on the TPU. If you use some lightweight transfer learning techniques to tweak the model [1], the last layer happens on the CPU. That's supposed to be insignificant, but I haven't actually tried it.

I'm very curious what you're using for a model. You're clearly further along than I am. Did you use your own cameras' data? Did you do transfer learning? (If so, what did you start from? you mentioned SSDMobileNet and Yolo3. Do you have a favorite?) Did you build a model from scratch?

Anyway, my point is that a similar project seems doable on a Raspberry Pi 4 with some extra hardware. I don't mean to say that you're Doing It Wrong for using a Xavier. I've thought about buying one of those myself...

[1] https://coral.ai/docs/edgetpu/models-intro/#transfer-learnin...

staycoolboy · on June 22, 2020

> My understanding is that with the pretrained models, everything happens on the TPU.

Nope. Try running SSDMN on a laptop with the stick and on a pi, you will get different scores due to some layers running on the host CPU.

nl · on June 22, 2020

The Orange Pi AI Stick Lite looks really interesting.

Here's the link: https://www.aliexpress.com/item/32958159325.html and it says the PLAI training tools are (now?) free on request.

scottlamb · on June 22, 2020

Yeah, that's promising, although I don't think there's much hope of support if it doesn't work as promised. And I have doubts about the software quality. As a small example: if you follow Gyrfalcon's installation instructions for the basic Plai Builder, it sets up a udev rule that makes every SCSI device world-writeable. I realized that by accident later. And of course everything is closed-source.

Gyrfalcon's own site is actively hostile to hobbyists. They only want to deal with researchers and folks preparing to package their chips into volume products. Signing up with a suitable email address and being manually approved lets you buy the device. You then have to negotiate to buy the Model Development Kits.

Hardware-wise, their stuff looks really neat. The $20 Orange Pi AI Stick Lite has the 2801 chip at 5.6 TOPS. Gyrfalcon's version of it costs $50. The 2803 chip does 16.8 TOPS. Gyrfalcon's USB-packaged version costs $70. That'd be a fantastic deal if the software situation were satisfactory, and a future Orange Pi version might be even cheaper.

nl · on June 22, 2020

This is sadly typical, and while I understand they don't want the support burden of hobbyists I would have thought the OrangePI would ship in interesting enough numbers for there to be some kind of support.

It looks like the OrangePi 4B includes ones of these chips on board?

scottlamb · on June 23, 2020

> It looks like the OrangePi 4B includes ones of these chips on board?

Yes, it has a 2801S.

And the SolidRun Hummingboard Ripple has a 2803S. Seems a little pricy compared to a Raspberry Pi 4 + USB PLAI Plug 2803, but maybe worth it if you can actually get the software...(and I don't think they just give you one download that supports both models)

scottlamb · on June 22, 2020

> * iirc, I got the Coral USB Accelerator to do about 180 fps with this model. [edit: but don't trust my memory—it could have been as low as 100 fps.]

Just dusted off my test program. 115.5 fps on my Intel NUC. I think that's the limit of this model on the Coral USB Accelerator, or very close to it.

My Raspberry Pi 4 is still compiling...I might update with that number in a bit. Likely the H.264 decoding will be the bottleneck, as I haven't set up hardware decoding.

scottlamb · on June 22, 2020

72.2 fps on the Raspberry Pi 4 right now, with CPU varying between 150%–220%. I expect with some work I could max out the Coral USB Accelerator as the Intel NUC is likely doing already.

quietbritishjim · on June 22, 2020

> You really want to use ... C/C++ for inference, not Python ...

> You need ... C/C++, not Python.

I think this is a red herring. Usually for deep learning you just use Python to plug together the libraries that actually do the processing, and those are written in terms of C/C++. You can see that in the article where the numpy array returned from OpenCV's video capture API is passed directly to tensorflow. Python never touches the individual pixels of the image directly, and once that's inside tensorflow it's irrelevant that a Python object briefly represented it.

> with a Jetson Xavier

Well that's obviously the real difference. It's not even just the same general type of computer but a bit faster - the Jetson has a decent NVidia GPU on board whereas the Raspberry Pi is doing the processing on its extremely limp CPU. Indeed that's the whole point of the Jetson; it's basically an NVidia graphics card with extra components strapped to it to turn it into a full computer.

> You really want to use DeepStream ... not TensorFlow

I'm not familiar with DeepStream, so I'm not so sure about this, but again this is unlikely to make a great deal of difference. It's certainly not the main factor at play here: that's definitely the Jetson's GPU, which of course TensorFlow can certainly use (via CUDA and CUDNN, as does DeepStream). It's true that using TensorRT can provide a speed boost on a Jetson, but even that's possible with TensorFlow, although admittedly you have to remember to call it specifically but it's just three or four lines of (Python!) code. There are already so many ways it's unavoidable to tie yourself into NVidia's ecosystem, it seems like a bad idea to tie yourself in further in a totally avoidable way like this.

[Edit: I just realised that the image is being streamed to a remote computer that's doing the inference. The general point remains though. The totally different architecture (including having to transfer data over the network) and hardware are the actual reason for the performance difference, while C/C++ vs Python and DeepStream vs TensorFlow are tiny details.]

agibsonccc · on June 22, 2020

If something were to be more "neutral" what would you hope to see exactly? Something performant is typically going to be framework/hardware specific.

quietbritishjim · on June 22, 2020

Sorry, I'm not sure what you mean by "neutral". Are you talking about my suggestion to avoid DeepStream? If so:

The frameworks that work on multiple types of hardware, like TensorFlow and (probably most popular now) PyTorch, have separate backends for their different targets. Each of these backends have huge amounts of platform-specific code, and in the case of the Nvidia backend, that code is written in terms of CUDA just as DeepStream is. That's how they achieve good performance even though the top-level API is hardware generic. The overwhelming majority of deep learning code, both the actual learning and the inference, is written in terms of these frameworks rather than NVidia's proprietary framework. Admittedly I haven't played with NVidia's library, but I highly doubt there's a serious performance difference - it's even possible that the open-source libraries are faster due to the greater community (/Google) effort to optimise them.

It does look like DeepStream does a lot more of the processing pipeline than just the inference. In that case it's going to be a lot more tricky to get the whole pipeline on the GPU using those TensorFlow or PyTorch. At the end of the day, if only DeepStream does what you need, I'm not saying you necessarily shouldn't use it - just that you should ideally attempt to avoid it if reasonably possible.

m463 · on June 22, 2020

I think the difference with the jetson xavier is the tensor cores. The xavier is different from the pi (and even the jetson nano), like 100x different.

quietbritishjim · on June 22, 2020

The Raspberry Pi doesn't have any "tensor cores" at all. According to Wikipedia, it actually does have a "Broadcom VideoCore IV" GPU, but I don't think this processor is ever used for deep learning. So if you did inference on the Pi then it would have to be on the CPU; inference is slower even on a meaty desktop CPU than on a GPU, never mind the low-powered CPU on the Pi.

That is all academic, as the whole point of the article is actually that the processing isn't done on the Pi but on the remote server. In that case the difference (if there even is one, I don't see a frame rate mentioned in the article) is indeed down to the difference in power of the respective GPUs, as you're alluding to, or to do with the fact that the article is having to stream the image frames over the network (it doesn't even seem to compress them) whereas the parent comment's idea just processes them locally.

m463 · on June 22, 2020

You know what, you're right and I'm wrong.

I went back and looked carefullier, and I must have read "on the edge", then "testing it locally" then "integrating tensorflow" and thought they moved it. But it doesn't actually do it on the edge at all. I think I need to learn to read.

quietbritishjim · on June 22, 2020

As I said in another comment here, I and lots of other commenters misread it that way too. I definitely find it funny they took "on the edge" to mean "anywhere on my local network", rather than just on the actual device capturing the data.

reichardt · on June 21, 2020

TensorFlow Lite with SSDLite-MobileNet gets you around 4 fps on a Raspberry Pi 4 (23 fps with a Coral USB Accelerator): https://github.com/EdjeElectronics/TensorFlow-Lite-Object-De...

scottlamb · on June 22, 2020

You should be able to do a lot better than that if you're careful with the software. As I mentioned in another comment, the Coral USB Accelerator can do at least 100 fps. I haven't looked closely at that link, but likely they're doing H.264 decoding in software using one thread, then downsampling in software using one thread, then waiting for the Coral USB accelerator, and repeating. Maybe they also have the accelerator plugged into a USB 2.0 port rather than a USB 3.0 port.

The better approach is to use threading to keep all the Pi's cores busy and the USB accelerator busy at the same time, and to use hardware acceleration.

bufferoverflow · on June 22, 2020

I think I was able to run Yolo 3 on my shitty $99 smartphone a while ago. Did it for human detection. Don't remember the FPS, but it wasn't 0.1, it was much better than that.

The beauty of a smartphone is it's all in one small package, and it has everything - the CPU/GPU + camera + 4G/3G + wifi, plus you can trivially hook it up to a huge USB powerbank. They even have weatherproofed ones.

RPi will cost you more with all the bells and whistles to actually make it work for this case.

agibsonccc · on June 22, 2020

How did you set it up on the software side though? How flexible/customizable was it?

stallmanite · on June 22, 2020

I’m really interested in this, do you have anything written about how you did it?

nl · on June 22, 2020

Something seems weird here.

I agree that Python has some overheard, but the time taken should presumably be dominated by the neural network object detection. In TensorFlow that is written in (highly optimised) C, and should be using the NEON instructions on ARM[1].

Notably, DeepStream gives the same performance with the Python and C++[2].

YOLO inference speed is generally higher than a Mobilenet SSD, but you can run YOLO on TensorFlow instead of Darknet[3], or use a NNPACK version of Darknet.

Edit: "I don't need to turn on the noisy Xavier fan." - wait - this isn't on a Raspberry Pi? If you have a GPU on device then there's lots of other things going on.

[1] https://www.tensorflow.org/install/source_rpi

[2] https://developer.nvidia.com/deepstream-sdk (Scroll down for benchmarks)

[3] https://github.com/hunglc007/tensorflow-yolov4-tflite

staycoolboy · on June 22, 2020

> wait - this isn't on a Raspberry Pi?

Literally the first sentence of my post.

You and many others seem to forget that I explicitly stated I wanted 4 HD streams at 30fps from my home IP address.

> but the time taken should presumably be dominated by the neural network object detection

There is a lot more to a pipeline than just inference.

The problem is aggregating the video streams, downsampling, submitting for inference, and then activating the Gstreamer element to write the MPEG. Most of this can use the nvidia memory properties on the nv* elements, which is great! However eventually you need to copy out for the Arm core. 4 HD streams is a lot of work for a small Arm core. The basic Gstreamer elements do not use the ACL AFAIK. I did recompile them with ORC optimization, but I'm not too familiar how/if that uses ACL/NEON. And the one that I built is basically just a bus messaging system for the downstream codec pipeline that is flagged by the tracker.

Can you point me to the benchmark in your link [#2] that indicates Python & C++ have the same performance? Nvidia does have an advantage in that Tensorflow-gpu supports them natively, but that is just for inference. I only see one table and the comments explicitly state they use the DS SDK and -not- TFlow.

exhilaration · on June 21, 2020

Do you have any links you could share to build something like this?

thebruce87m · on June 21, 2020

I’d love to read a write-up of this.

staycoolboy · on June 22, 2020

Me too. :)

It is literally 90% in the NVIDIA SDK already. The demo examples provided with the kit read an HD stream, run inference and tracking -AND- provide an RTSP output! I started by replacing the HD stream with a videomux from my cameras into a composite image.

Working with the the Gstreamer RTSP server is hard, and NVIDIA basically hands it to you.

The next steps were to put tee elements on all 4 input streams to a selection demux and recompiled the tracker to send a bus event downstream to my element that controls the output of a video mux. This decides whether to send images to the final "videoconvert ! mp4mux ! payloder ! udpsink" pipeline that goes to a file... sort of.

It gets a little messy because I couldn't figure out how to start / stop gst's filesink so I send it to a UDP port instead and have another process on the machine grab payload packets and decide when to create a new file. It used to be one big MP4 file that was pushed to the cloud, and .. um, the program would crash and I would restart the process to get the next file ... i know ... currently I'm trying to chop it up based on idle time (e.g., no new frames in 300ms? start a new file!). It's ugly and I still get corrupt files sometimes, which is why I haven't written it up... and i'm lazy. I bet I could fix this if there was a manual to RTFM, but GStreamer is such a bear to work with and the only source of help is their weird mailing list archive.

haser_au · on June 22, 2020

Similar to other's comments, I'd love to read a write up about this.

dheera · on June 21, 2020

Can we all please stop using the term "edge" computing? It's nothing but a hype term and in reality it's really what we already had for the decades before the internet.

hikarudo · on June 22, 2020

I disagree. The term "edge computing" actually adds precision to a description of a distributed system. Nowadays, with a lot of machine learning inference happening on the cloud, when seeing the term "edge inference" you immediately know you don't have to send heavy bandwidth-clogging video streams to the cloud.

Inference on the edge is a clear trend in computer vision applications, now that we each year there are better low-power neural network accelerators.

dheera · on June 22, 2020

> Nowadays, with a lot of machine learning inference happening on the cloud

Right, and if it's not on the cloud, it runs locally, as everything did before "cloud" became popular. We don't need to call it "edge" just to raise VC money or put out some PR. We can just say it runs locally, on-device, etc.

If (big if) and when Adobe realizes that their Creative Cloud was a bad idea, are they going to call the next product "Adobe Edge Edition! Wow you can actually run PhotoShop on your own desktop!"?

gspr · on June 22, 2020

> Right, and if it's not on the cloud, it runs locally, as everything did before "cloud" became popular. We don't need to call it "edge" just to raise VC money or put out some PR. We can just say it runs locally, on-device, etc.

To me, "edge" means more than just "not cloud". It's appropriately used when making the point that computations happen where the data is gathered and the output is required (which seems actually not to be the case in TFA, but still). It's when computations are not offloaded elsewhere at all, not just "not to the cloud".

dheera · on June 22, 2020

> making the point that computations happen where the data is gathered and the output is required

This is how literally everything was done before the internet. It shouldn't be thought of as a new fancy concept.