Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

But people use grep on their code all the time ...


People use grep on their local code repository, which is generally less than 2 gigabytes of source. A tool like ripgrep can process that in under a second on any modern machine with a warm disk cache.

Its when you get to hundreds of repositories or 10's of gigabytes of code that local tools cannot run fast enough. They are not designed for this use-case, and usually rely on the files being searched hitting the disk cache for repeated search performance.

It may be possible for github to shell out to grep for a single repository search (I have no idea how the back-end works but I doubt its impossible) but, I suspect that almost everyone wants/expects this to work across multiple repositories or across all of the github repositories.

Since its not easily possible to do so across everything they are not adding it to even a single repository to avoid search working differently for different situations, which is a fair approach in my opinion.


Nobody is asking for cross-repo search. Literally just let us run "git grep" on Github.


On your local machine, you can get away with a linear scan (as "git grep" does), because nothing else is contending for time with your search. "git grep" is actually very costly in terms of CPU and disk IO, but you don't notice, because you're only running one "git grep" at once, and so nothing is contending for those resources.

On GitHub, tens of thousands of searches will be happening at once on the same search cluster. If they were all literally doing a "git grep" (a linear scan of the associated repo data), the disk caches would thrash back and forth between queries, and nothing could be answered in less than 30 seconds.

The only way for GitHub to respond to code searches at scale in a reasonable time, is to have a pre-built index.

If the index was per repo, that'd be a kind of partitioned index; and there's no DBMS that I know of that can handle a partitioned object having 58 million partitions. It makes much more sense to have an unpartitioned index... which effectively implies "cross-repo search" (because, once you have an unpartitioned index built up over all your repos, it costs nothing to enable someone to search that entire index at once.)


While that's basically true, git grep doesn't scale...

With ever-cheaper RAM, ever-faster SSDs, and cloud computing, it could actually make sense, now or soon, to scan through an entire repo, either on "disk" or in RAM. I started building an app that would suck down any GitHub repo by simulating a git clone in the back-end, in the time it took you to type in your query, and hold it in RAM across queries. As a user, I'm sure I would gladly pay however many cents it costs to take up this much RAM when I'm on the search page over the course of a month.

Even if a linear scan isn't feasible, regex search at scale is not an unsolved problem, and GitHub has access to world-class engineers. Google Code used trigrams to do regex search with an index (https://swtch.com/~rsc/regexp/regexp4.html). Sourcegraph offers regex search.


When I press "download" on GitHub, it also does a linear scan ...


Of the files, maybe, but not the characters in those files. It's orders of magnitude more difficult.


it would only really make sense to have one big index, and to narrow the search to particular project in that index. This of course has the benefit that you can move out to the whole index if you want.

You could then offer code search across the whole index (minus any project marked private that you are not a part of) as a paid offering.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: