I've done this with a Jetson Xavier, 4 CCTV cameras and a PoE hub. You really wa...

scottlamb · on June 21, 2020

> You need way more processing power than an RPi to do this at 30fps, and C/C++, not Python. (There are literally dozens of projects for the RPi and TFlow online but they all get like 0.1 fps or less by using Flask and browser reload of a PNG... great for POC but not for real video)

I think 8 streams at 15 fps (aka 120 fps total) is possible with a ($35) Raspberry Pi 4 + ($75) Coral USB Accelerator. I say "I think" because I haven't tested on this exact setup yet. My Macbook Pro and Intel NUC are a lot more pleasant to experiment on (much faster compilation times). A few notes:

* I'm currently just using the coral.ai prebuilt 300x300 MobileNet SSD v2 models. I haven't done much testing but can see it has notable false negatives and positives. It'd be wonderful to put together some shared training data [1] to use for transfer learning. I think then results could be much better. Anyone interested in starting something? I'd be happy to contribute!

* iirc, I got the Coral USB Accelerator to do about 180 fps with this model. [edit: but don't trust my memory—it could have been as low as 100 fps.] It's easy enough to run the detection at a lower frame rate than the input as well—do the H.264 decoding on every frame but only do inference at fixed pts intervals.

* You can also attach multiple Coral USB Accelerators to one system and make use of all of them.

* Decoding the 8 streams is likely possible on the Pi 4 depending on your resolution. I haven't messed with this yet, but I think it might even be possible in software, and the Pi has hardware H.264 decoding that I haven't tried to use yet.

* I use my cameras' 704x480 "sub" streams for motion detection and downsample that full image to the model's expected 300x300 input. Apparently some people do things like multiple inference against tiles of the image or running a second round of inference against a zoomed-in object detection region to improve confidence. That obviously increases the demand on both the CPU and TPU.

* The Orange Pi AI Stick Lite is crazy cheap ($20) and supposedly comparable to the Coral USB Accelerator in speed. At that price if it works buying one per camera doesn't sound too crazy. But I'm not sure if drivers/toolchain support are any good. I have a PLAI Plug (basically the same thing but sold by the manufacturer). The PyTorch-based image classification on a prebuilt model works fine. I don't have the software to build models or do object detection so it's basically useless right now. They want to charge an unknown price for the missing software, but I think Orange Pi's rebrand might include it with the device?

[1] https://groups.google.com/g/moonfire-nvr-users/c/ZD1uS7kL7tc...

serf · on June 22, 2020

>* I use my cameras' 704x480 "sub" streams for motion detection and downsample..

i've encountered cheap IPTV cameras where the main high-res stream was actually being offered with a time-shift compared to the sub-stream.

weird shit happens when you have a camera that does that, then you act on data from the sub-stream to work with data on the main stream. I played with a 'Chinesium' cctv with generic firmware that had such a bad offset that I could actually use a static offset to remediate it.

I assumed it was just a firmware bug, since the offsets didn't seem to move around as if it was a decode/encode lag or anything of that sort.

scottlamb · on June 22, 2020

Yeah, that sucks.

Did the camera send SEI Picture Timing messages? RTCP Sender Reports with NTP timestamps? Either could potentially help matters if they're trustworthy.

I haven't encountered that exact problem (large fixed offset between the streams), but I agree in general these cameras' time support is poor and synchronizing streams (either between main/sub of a single camera or across cameras) is a pain point. Here's what my software is doing today:

https://github.com/scottlamb/moonfire-nvr/blob/master/design...

Any of several changes to the camera would improve matters a lot:

* using temporal/spatial/quality SVC (Scalable Video Coding) so you can get everything you need from a single video stream

* exposing timestamps relative to the camera's uptime (CLOCK_MONOTONIC) somehow (not sure where you'd cram this into a RTSP session) along with some random boot id

* allow fetching both the main and sub video streams in a single RTSP session

* reliably slewing the clock like a "real" NTP client rather than stepping with SNTP

but I'm not exactly in a position to make suggestions that the camera manufacturers jump to implement...

staycoolboy · on June 22, 2020

I started with an Rpi by itself. Then I tried a Coral USB stick. I also tried the Intel Neural Compute Stick 2. The Coral USB accelerator doesn't accelerate all of the layers, only some of them. The CPU has to do the rest of the work. Plus, you only get this speed if you preload an image into memory and blast it through the accelerator in a loop. This ignores getting the image INTO the accelerator, which requires reshaping and shipping across USB. It fell to pieces with -one- 720P video stream. The NCS is worse.

I didn't bother with multiple $100 coral accelerators because why when I already have a Xavier?

As I said, my goal was 20-30fps with HD streams. Sure I could drop the quality, but I didn't want to, that was the point.

scottlamb · on June 22, 2020

> The Coral USB accelerator doesn't accelerate all of the layers, only some of them.

My understanding is that with the pretrained models, everything happens on the TPU. If you use some lightweight transfer learning techniques to tweak the model [1], the last layer happens on the CPU. That's supposed to be insignificant, but I haven't actually tried it.

I'm very curious what you're using for a model. You're clearly further along than I am. Did you use your own cameras' data? Did you do transfer learning? (If so, what did you start from? you mentioned SSDMobileNet and Yolo3. Do you have a favorite?) Did you build a model from scratch?

Anyway, my point is that a similar project seems doable on a Raspberry Pi 4 with some extra hardware. I don't mean to say that you're Doing It Wrong for using a Xavier. I've thought about buying one of those myself...

[1] https://coral.ai/docs/edgetpu/models-intro/#transfer-learnin...

staycoolboy · on June 22, 2020

> My understanding is that with the pretrained models, everything happens on the TPU.

Nope. Try running SSDMN on a laptop with the stick and on a pi, you will get different scores due to some layers running on the host CPU.

nl · on June 22, 2020

The Orange Pi AI Stick Lite looks really interesting.

Here's the link: https://www.aliexpress.com/item/32958159325.html and it says the PLAI training tools are (now?) free on request.

scottlamb · on June 22, 2020

Yeah, that's promising, although I don't think there's much hope of support if it doesn't work as promised. And I have doubts about the software quality. As a small example: if you follow Gyrfalcon's installation instructions for the basic Plai Builder, it sets up a udev rule that makes every SCSI device world-writeable. I realized that by accident later. And of course everything is closed-source.

Gyrfalcon's own site is actively hostile to hobbyists. They only want to deal with researchers and folks preparing to package their chips into volume products. Signing up with a suitable email address and being manually approved lets you buy the device. You then have to negotiate to buy the Model Development Kits.

Hardware-wise, their stuff looks really neat. The $20 Orange Pi AI Stick Lite has the 2801 chip at 5.6 TOPS. Gyrfalcon's version of it costs $50. The 2803 chip does 16.8 TOPS. Gyrfalcon's USB-packaged version costs $70. That'd be a fantastic deal if the software situation were satisfactory, and a future Orange Pi version might be even cheaper.

nl · on June 22, 2020

This is sadly typical, and while I understand they don't want the support burden of hobbyists I would have thought the OrangePI would ship in interesting enough numbers for there to be some kind of support.

It looks like the OrangePi 4B includes ones of these chips on board?

scottlamb · on June 23, 2020

> It looks like the OrangePi 4B includes ones of these chips on board?

Yes, it has a 2801S.

And the SolidRun Hummingboard Ripple has a 2803S. Seems a little pricy compared to a Raspberry Pi 4 + USB PLAI Plug 2803, but maybe worth it if you can actually get the software...(and I don't think they just give you one download that supports both models)

scottlamb · on June 22, 2020

> * iirc, I got the Coral USB Accelerator to do about 180 fps with this model. [edit: but don't trust my memory—it could have been as low as 100 fps.]

Just dusted off my test program. 115.5 fps on my Intel NUC. I think that's the limit of this model on the Coral USB Accelerator, or very close to it.

My Raspberry Pi 4 is still compiling...I might update with that number in a bit. Likely the H.264 decoding will be the bottleneck, as I haven't set up hardware decoding.

scottlamb · on June 22, 2020

72.2 fps on the Raspberry Pi 4 right now, with CPU varying between 150%–220%. I expect with some work I could max out the Coral USB Accelerator as the Intel NUC is likely doing already.

quietbritishjim · on June 22, 2020

> You really want to use ... C/C++ for inference, not Python ...

> You need ... C/C++, not Python.

I think this is a red herring. Usually for deep learning you just use Python to plug together the libraries that actually do the processing, and those are written in terms of C/C++. You can see that in the article where the numpy array returned from OpenCV's video capture API is passed directly to tensorflow. Python never touches the individual pixels of the image directly, and once that's inside tensorflow it's irrelevant that a Python object briefly represented it.

> with a Jetson Xavier

Well that's obviously the real difference. It's not even just the same general type of computer but a bit faster - the Jetson has a decent NVidia GPU on board whereas the Raspberry Pi is doing the processing on its extremely limp CPU. Indeed that's the whole point of the Jetson; it's basically an NVidia graphics card with extra components strapped to it to turn it into a full computer.

> You really want to use DeepStream ... not TensorFlow

I'm not familiar with DeepStream, so I'm not so sure about this, but again this is unlikely to make a great deal of difference. It's certainly not the main factor at play here: that's definitely the Jetson's GPU, which of course TensorFlow can certainly use (via CUDA and CUDNN, as does DeepStream). It's true that using TensorRT can provide a speed boost on a Jetson, but even that's possible with TensorFlow, although admittedly you have to remember to call it specifically but it's just three or four lines of (Python!) code. There are already so many ways it's unavoidable to tie yourself into NVidia's ecosystem, it seems like a bad idea to tie yourself in further in a totally avoidable way like this.

[Edit: I just realised that the image is being streamed to a remote computer that's doing the inference. The general point remains though. The totally different architecture (including having to transfer data over the network) and hardware are the actual reason for the performance difference, while C/C++ vs Python and DeepStream vs TensorFlow are tiny details.]

agibsonccc · on June 22, 2020

If something were to be more "neutral" what would you hope to see exactly? Something performant is typically going to be framework/hardware specific.

quietbritishjim · on June 22, 2020

Sorry, I'm not sure what you mean by "neutral". Are you talking about my suggestion to avoid DeepStream? If so:

The frameworks that work on multiple types of hardware, like TensorFlow and (probably most popular now) PyTorch, have separate backends for their different targets. Each of these backends have huge amounts of platform-specific code, and in the case of the Nvidia backend, that code is written in terms of CUDA just as DeepStream is. That's how they achieve good performance even though the top-level API is hardware generic. The overwhelming majority of deep learning code, both the actual learning and the inference, is written in terms of these frameworks rather than NVidia's proprietary framework. Admittedly I haven't played with NVidia's library, but I highly doubt there's a serious performance difference - it's even possible that the open-source libraries are faster due to the greater community (/Google) effort to optimise them.

It does look like DeepStream does a lot more of the processing pipeline than just the inference. In that case it's going to be a lot more tricky to get the whole pipeline on the GPU using those TensorFlow or PyTorch. At the end of the day, if only DeepStream does what you need, I'm not saying you necessarily shouldn't use it - just that you should ideally attempt to avoid it if reasonably possible.

m463 · on June 22, 2020

I think the difference with the jetson xavier is the tensor cores. The xavier is different from the pi (and even the jetson nano), like 100x different.

quietbritishjim · on June 22, 2020

The Raspberry Pi doesn't have any "tensor cores" at all. According to Wikipedia, it actually does have a "Broadcom VideoCore IV" GPU, but I don't think this processor is ever used for deep learning. So if you did inference on the Pi then it would have to be on the CPU; inference is slower even on a meaty desktop CPU than on a GPU, never mind the low-powered CPU on the Pi.

That is all academic, as the whole point of the article is actually that the processing isn't done on the Pi but on the remote server. In that case the difference (if there even is one, I don't see a frame rate mentioned in the article) is indeed down to the difference in power of the respective GPUs, as you're alluding to, or to do with the fact that the article is having to stream the image frames over the network (it doesn't even seem to compress them) whereas the parent comment's idea just processes them locally.

m463 · on June 22, 2020

You know what, you're right and I'm wrong.

I went back and looked carefullier, and I must have read "on the edge", then "testing it locally" then "integrating tensorflow" and thought they moved it. But it doesn't actually do it on the edge at all. I think I need to learn to read.

quietbritishjim · on June 22, 2020

As I said in another comment here, I and lots of other commenters misread it that way too. I definitely find it funny they took "on the edge" to mean "anywhere on my local network", rather than just on the actual device capturing the data.

reichardt · on June 21, 2020

TensorFlow Lite with SSDLite-MobileNet gets you around 4 fps on a Raspberry Pi 4 (23 fps with a Coral USB Accelerator): https://github.com/EdjeElectronics/TensorFlow-Lite-Object-De...

scottlamb · on June 22, 2020

You should be able to do a lot better than that if you're careful with the software. As I mentioned in another comment, the Coral USB Accelerator can do at least 100 fps. I haven't looked closely at that link, but likely they're doing H.264 decoding in software using one thread, then downsampling in software using one thread, then waiting for the Coral USB accelerator, and repeating. Maybe they also have the accelerator plugged into a USB 2.0 port rather than a USB 3.0 port.

The better approach is to use threading to keep all the Pi's cores busy and the USB accelerator busy at the same time, and to use hardware acceleration.

bufferoverflow · on June 22, 2020

I think I was able to run Yolo 3 on my shitty $99 smartphone a while ago. Did it for human detection. Don't remember the FPS, but it wasn't 0.1, it was much better than that.

The beauty of a smartphone is it's all in one small package, and it has everything - the CPU/GPU + camera + 4G/3G + wifi, plus you can trivially hook it up to a huge USB powerbank. They even have weatherproofed ones.

RPi will cost you more with all the bells and whistles to actually make it work for this case.

agibsonccc · on June 22, 2020

How did you set it up on the software side though? How flexible/customizable was it?

stallmanite · on June 22, 2020

I’m really interested in this, do you have anything written about how you did it?

nl · on June 22, 2020

Something seems weird here.

I agree that Python has some overheard, but the time taken should presumably be dominated by the neural network object detection. In TensorFlow that is written in (highly optimised) C, and should be using the NEON instructions on ARM[1].

Notably, DeepStream gives the same performance with the Python and C++[2].

YOLO inference speed is generally higher than a Mobilenet SSD, but you can run YOLO on TensorFlow instead of Darknet[3], or use a NNPACK version of Darknet.

Edit: "I don't need to turn on the noisy Xavier fan." - wait - this isn't on a Raspberry Pi? If you have a GPU on device then there's lots of other things going on.

[1] https://www.tensorflow.org/install/source_rpi

[2] https://developer.nvidia.com/deepstream-sdk (Scroll down for benchmarks)

[3] https://github.com/hunglc007/tensorflow-yolov4-tflite

staycoolboy · on June 22, 2020

> wait - this isn't on a Raspberry Pi?

Literally the first sentence of my post.

You and many others seem to forget that I explicitly stated I wanted 4 HD streams at 30fps from my home IP address.

> but the time taken should presumably be dominated by the neural network object detection

There is a lot more to a pipeline than just inference.

The problem is aggregating the video streams, downsampling, submitting for inference, and then activating the Gstreamer element to write the MPEG. Most of this can use the nvidia memory properties on the nv* elements, which is great! However eventually you need to copy out for the Arm core. 4 HD streams is a lot of work for a small Arm core. The basic Gstreamer elements do not use the ACL AFAIK. I did recompile them with ORC optimization, but I'm not too familiar how/if that uses ACL/NEON. And the one that I built is basically just a bus messaging system for the downstream codec pipeline that is flagged by the tracker.

Can you point me to the benchmark in your link [#2] that indicates Python & C++ have the same performance? Nvidia does have an advantage in that Tensorflow-gpu supports them natively, but that is just for inference. I only see one table and the comments explicitly state they use the DS SDK and -not- TFlow.

exhilaration · on June 21, 2020

Do you have any links you could share to build something like this?

thebruce87m · on June 21, 2020

I’d love to read a write-up of this.

staycoolboy · on June 22, 2020

Me too. :)

It is literally 90% in the NVIDIA SDK already. The demo examples provided with the kit read an HD stream, run inference and tracking -AND- provide an RTSP output! I started by replacing the HD stream with a videomux from my cameras into a composite image.

Working with the the Gstreamer RTSP server is hard, and NVIDIA basically hands it to you.

The next steps were to put tee elements on all 4 input streams to a selection demux and recompiled the tracker to send a bus event downstream to my element that controls the output of a video mux. This decides whether to send images to the final "videoconvert ! mp4mux ! payloder ! udpsink" pipeline that goes to a file... sort of.

It gets a little messy because I couldn't figure out how to start / stop gst's filesink so I send it to a UDP port instead and have another process on the machine grab payload packets and decide when to create a new file. It used to be one big MP4 file that was pushed to the cloud, and .. um, the program would crash and I would restart the process to get the next file ... i know ... currently I'm trying to chop it up based on idle time (e.g., no new frames in 300ms? start a new file!). It's ugly and I still get corrupt files sometimes, which is why I haven't written it up... and i'm lazy. I bet I could fix this if there was a manual to RTFM, but GStreamer is such a bear to work with and the only source of help is their weird mailing list archive.

haser_au · on June 22, 2020

Similar to other's comments, I'd love to read a write up about this.