It is a real cultural problem how engineers get more excited about machine learn...

ddevault · on Sept 18, 2018

I agree with this sentiment 100%. I can use traditional search engines for "how to ping a rest thing in python", but I can't grep Github for even basic snippets of code. I don't think their global code search has ever been useful. Glad they have their priorities straight /s

kornish · on Sept 18, 2018

Agreed. Luckily, we as a community have tools like Sourcegraph which are based on battle-tested pragmatic systems from places like Google.

Disclaimer: no affiliation, just love the team and product.

welder · on Sept 19, 2018

At first I thought this would replace Sourcegraph, but looks like it's just an experiment with NLP... Thank goodness we have Sourcegraph for searching GH but especially for searching GHEnterprise in an SOA environment where it's impossible to have every repo cloned locally for ripgrep.

P.S. I'm not affiliated, we just use Sourcegraph at a company I work for.

dgreensp · on Sept 18, 2018

Thanks for the recommendation

bryanrasmussen · on Sept 19, 2018

Damn, Sourcegraph is very close to something I've been thinking of building.

sqs · on Sept 19, 2018

Join us! We are hiring and growing quickly.

samlambert · on Sept 18, 2018

All I can say is that we know this. We know it should be better. There's definitely more to come.

neongreen · on Sept 18, 2018

I'm building one! https://codesearch.aelve.com

Currently it runs on a fairly slow machine, so regex-heavy requests will take some time on big package repositories like Rubygems, but I plan to get a nicer machine soon.

If you know Scala, you can even contribute (wink wink), just ping me. A lot of tasks we have at this stage are pretty basic.

nh2 · on Sept 19, 2018

Nice! Next index all public Github repos?

psychometry · on Sept 18, 2018

You also can't search a forked repository, which is pathetic.

_ps6d · on Sept 18, 2018

This limitation is especially frustrating when a fork becomes the "primary" repo for a project for some reason. It's probably not a common occurrence overall, but I've run into it at least a couple of times.

A good example is that GitHub's own repo for their CommonMark implementation isn't searchable, because it's a fork of cmark: https://github.com/github/cmark/

bradleyjg · on Sept 19, 2018

Exactly this. I don’t understand why the-thing-I-searched-for.java is so rarely on the first page of results. Doesn’t that seem like an obvious thing I might be interested in?!?

brian-armstrong · on Sept 18, 2018

Yes! When I read the post title I was really excited. The I clicked in and felt my heart sink a little. Engineers and PMs seem to be too easily swayed by shiny things.

boyter · on Sept 18, 2018

To be fair its a hard problem to solve, especially with traditional search engine tools.

Take for example,

    for(int i=0;i<100;i++)

And then a search for i++ Due to the way almost every search tool works that would be split into tokens "for int i 0 100" which are not very useful. Even if you include the characters = ; < + ( ) in the search you break the ability to do things such as boolean queries or fuzzy search term~1

Its totally possible to solve these issues using tweaks of the input into your index, which is what I did with searchcode.com or with a different approach which is what Google Code Search did. However neither have a requirement to be 100% in sync with the repository which I suspect is something that the github team value.

All the code search tools suffer from this in some way. At small scale its possible to just brute force the search. At scale you can do it by tweaking your algorithm and sacrificing accuracy. My feeling is that the github team chose accuracy.

amelius · on Sept 18, 2018

But people use grep on their code all the time ...

boyter · on Sept 18, 2018

People use grep on their local code repository, which is generally less than 2 gigabytes of source. A tool like ripgrep can process that in under a second on any modern machine with a warm disk cache.

Its when you get to hundreds of repositories or 10's of gigabytes of code that local tools cannot run fast enough. They are not designed for this use-case, and usually rely on the files being searched hitting the disk cache for repeated search performance.

It may be possible for github to shell out to grep for a single repository search (I have no idea how the back-end works but I doubt its impossible) but, I suspect that almost everyone wants/expects this to work across multiple repositories or across all of the github repositories.

Since its not easily possible to do so across everything they are not adding it to even a single repository to avoid search working differently for different situations, which is a fair approach in my opinion.

nerdponx · on Sept 18, 2018

Nobody is asking for cross-repo search. Literally just let us run "git grep" on Github.

derefr · on Sept 19, 2018

On your local machine, you can get away with a linear scan (as "git grep" does), because nothing else is contending for time with your search. "git grep" is actually very costly in terms of CPU and disk IO, but you don't notice, because you're only running one "git grep" at once, and so nothing is contending for those resources.

On GitHub, tens of thousands of searches will be happening at once on the same search cluster. If they were all literally doing a "git grep" (a linear scan of the associated repo data), the disk caches would thrash back and forth between queries, and nothing could be answered in less than 30 seconds.

The only way for GitHub to respond to code searches at scale in a reasonable time, is to have a pre-built index.

If the index was per repo, that'd be a kind of partitioned index; and there's no DBMS that I know of that can handle a partitioned object having 58 million partitions. It makes much more sense to have an unpartitioned index... which effectively implies "cross-repo search" (because, once you have an unpartitioned index built up over all your repos, it costs nothing to enable someone to search that entire index at once.)

dgreensp · on Sept 20, 2018

While that's basically true, git grep doesn't scale...

With ever-cheaper RAM, ever-faster SSDs, and cloud computing, it could actually make sense, now or soon, to scan through an entire repo, either on "disk" or in RAM. I started building an app that would suck down any GitHub repo by simulating a git clone in the back-end, in the time it took you to type in your query, and hold it in RAM across queries. As a user, I'm sure I would gladly pay however many cents it costs to take up this much RAM when I'm on the search page over the course of a month.

Even if a linear scan isn't feasible, regex search at scale is not an unsolved problem, and GitHub has access to world-class engineers. Google Code used trigrams to do regex search with an index (https://swtch.com/~rsc/regexp/regexp4.html). Sourcegraph offers regex search.

amelius · on Sept 19, 2018

When I press "download" on GitHub, it also does a linear scan ...

daze42 · on Sept 19, 2018

Of the files, maybe, but not the characters in those files. It's orders of magnitude more difficult.

bryanrasmussen · on Sept 19, 2018

it would only really make sense to have one big index, and to narrow the search to particular project in that index. This of course has the benefit that you can move out to the whole index if you want.

You could then offer code search across the whole index (minus any project marked private that you are not a part of) as a paid offering.

karmakaze · on Sept 18, 2018

In full agreement. Every now and then I'll expect a search to work. My solution has been to run etsy/hound [0] for my active reps.

  [0] https://github.com/etsy/hound

petters · on Sept 19, 2018

Could not agree more. Everyone who works for or have worked for Google in the last years knows that an excellent code search does not have to be fancy.

VirenM · on Sept 19, 2018

I've honestly moved to Google, adding

`{search query} -site:github.com/{repo}/{file i want to target}`

Its much clearer and concise.

mullikine · on Sept 18, 2018

I made a regex search for GitHub and emacs plugin. In theory I could put this on GitHub. It uses the bigquery ghtorrent table. There's only so much time in a day though. If you want it upvote me

bitL · on Sept 19, 2018

So what? Here their only mistake is not to license/buy some search engine instead of wasting years on developing another "meh" one. What they are doing here with semantic search is the future and their chance to make all existing code search engines obsolete. Use your favorite Internet search engine to find GitHub's snippets of code instead. Those won't give you semantic code search though.