I'm curious to hear people's thoughts on eBPF generally, it seems likely that this is where observability companies are headed. It's non-trivial to implement but monitoring from the kernel layer makes so much sense that I expect the tooling will come along quickly.
eBPF is great but it only works for linux and access to the kernel layer doesn't work for serverless environments, so it's definitely a piece of the puzzle but not a silver bullet IMO.
eBPF is indeed a part of the puzzle. It allows us to access telemetry data without any manual instrumentation when running on Linux machines.
Pixie itself is extendable and currently ingests data from many other sources as well. Joining forces with New Relic will allow us to focus on expanding the open-source project, but also expand our capabilities by plugging into other open APIs and frameworks such as OpenTelemetry, Grafana, Prometheus.
I caught your presentation to GoSF a couple weeks ago -- it was very impressive and I'm looking forward to the opportunity to apply lessons learned from that.
Why do you need the kernel support if you modify the binaries, why not insert a function to write your logs and then insert a call to that function rather than relying on kernel support via an int 3?
Because serverless still needs to run on a machine, and that machine is typically at least one of 1. shared with other users, in which case giving you kernel access would be a security issue, or 2. ephemeral (firecracker VM or such) in which case eBPF is... technically possible, but not nearly as useful (you go from "this server has had X events of type Y over the last 24 hours" to "this VM had X events happen in the 590ms before it was destroyed").
I see, I thought it could be used for some simple thing, like a load balancer / proxy with a bit of logic in it, but I guess it's too constrainted to do something useful as a server
We've been eagerly awaiting some customers to adopt newer kernels so we can start leveraging eBPF because of the performance gains in these type of scenarios.
Getting down the the kernel often can help find problems with disk access or network issues.
In Support Engineering we often straddle the line of 'SRE style stare at graphs and configuration as code' and 'log on to the box and look at syscalls'. We are very very excited about eBPF.
Deploys eBPF kprobes (based on bpftrace) and uprobe (based a custom front-end language) and instantly get rich data (arts, return value, latency), query data in a Pandas-like scripting language and visual dashboard.
Tracing from user to kernel back to userspace in one single pane of glass. Have you never heard anyone hype about DTrace on Solaris? This allows building similar things on production systems with little to no impact on running production applications.
dtrace has also been on macOS since 10.5, with a pretty nice GUI app as well. I’ve used it to trace ruby and python code to isolate slow API requests from the end user’s perspective.
Strictly speaking, one of the uses of eBPF is a DTrace-like tracing tool but it's also used in quite a few other places (for instance it's also used for devices cgroup policy in cgroupv2).
And filtering syscalls for seccomp or literally packet filtering. I believe Jens Axboe and team were looking at using eBPF for some of the low level IO subsystems.
But I was explaining the most common and obvious (for a user) use of eBPF.
They should just look at the io visor project to see some of the stuff that can be done with it (disclaimer, I work with one of the io visor maintainers)
The kernel has a "god's eye view" of everything happening on its OS. With event based tracing in the kernel there's no chance of missing an occurrence because your sampling rate is too low, for example. You can also correlate and enrich data that just isn't available in userspace.
It's extremely quick to do it there as well. Originally, bpf was created to replace tcpdump with a quicker less impactful alternative. People saw that it was a pretty neat alternative so they started extending it (the e in eBPF). Think they might have finally come round again and just call it BPF now again.
For hardware management maybe but so much of "observability" in general is at the application layer I can't see BPF displacing anything more than a tiny corner of it.
On the other hand - it already is overhauling service meshes, VPNs and firewalls and network security policies, etc. The stuff fly.io is debuting now is probably going to be standard in few years.
Strongly disagree. BPF uprobes allow extremely fine grained tracing of userspace applications, and allow you to programmatically correlate them with kernel level information.
Sure, but "fine-grained tracing" is itself such a small part of observability even if BPF takes over that entire part of the stack it's still nowhere near a complete observability story.
This is outside my area of expertise so maybe I'm just missing some deeper insight - and I can see where you're coming from for traditional "throw the artifact over the wall, good luck running it" system operations kind of stuff. Or debugging services in production, which tracing is a key part of - but again just a small part of observability, and one that the rest of your dev process should be actively trying to minimize. If you have any degree of DevOps going on, many key SLIs will much higher-level (p99 of HTTP requests, MB of storage per customer), and I don't see how eBPF addresses that better than existing instrumentation, or in some cases at all.