Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How do you search large codebases before adding a feature or fixing bug?
111 points by mohanmca on March 27, 2022 | hide | past | favorite | 82 comments
Why do we need to search source-code?

  1. Quickly learn the domain and context of the application
  2. After adding a feature, we should aware if we broke anything (assume you work with code that doesn't have test-case), it helps even to search testcases
  3. Find similar code and ensure you are improving quality of the overall similar code (not just fixing current bug)
  4. Understand how application behaves when there are production issues.
Most often I deal with large inherited code-base in my career, often we need to search similar code or usage of certain variable or a function/class/module. When it is statically typed language to certain extent IDE/compiler helps. But we have to deal with different languages and sometime developers copy/paste for various reason. Searching/grepping code and its usage seems to be very useful for various reasons.

You as a developer, what are all the ways you search source-code before working/fixing feature or bug? Do you use any CLI tools other than grep. I have used OpenGrok, But few times it is not maintained by me or other developers.

Below is my steps.

  1. Read the relevant code, and know certain domain keyword, variable names (inclusive class/method/function)
  2. Use the bitbucket/GitHub/git search
  3. Use the grep
  4. Use the git-grep
Still few times, I end up missing.

Seems like this is (Especially CLI based search) very valuable skill to have. Do you have any tips/tools for other developers?




I work on code bases with millions of lines, so I wrote a tool called Septum to help me (https://github.com/pyjarrett/septum/). This isn't to replace grep or ripgrep or silver searcher, those are all great tools you should have!

Septum is neighborhood based (context-based) search, so you can find contiguous groups of lines which contain specific things, but exclude other things. It's also interactive so you can add/remove filters as needed. This makes it useful for those cases where terms change based on their context so you can exclude terms related to the contexts you don't want to keep. It reads .septum/config which contains its normal commands to load directories and settings, so you can have different configs per project you're working on.


Thanks for sharing! This is an awesome tool. I also like seeing it is written in Ada. I don’t often find open source tools I like to use written in Ada. Super well organized, too!


This is amazing. So many times I've needed to do something similar to the demo, and I have to hack things around. I'll definitely be using this.


This is SUUUUUUPER cool. Definitely going to try this out. Thank you for making it.

Is there an easy way to install this on non-windows systems? I'm not familiar with Ada toolchains.


This is Windows/Linux only (I didn't have a Mac at the time). You can use Alire (https://alire.ada.dev) will install your toolchain as well as build an executable.


Looks pretty awesome! Will definitely give a try!

Wish to hear why did you choose to use Ada?


I normally work in C++, but I've also written in lots of other languages.

I wrote an article for these folks which includes some of the rationale here: https://blog.adacore.com/ada-crate-of-the-year-interactive-c...


Wow, great work!


Sourcetrail (Google) had been my primary go-to for large multi-MLOC C-code and C++ project like Mozilla Firefox recently for me. I can now insert instrumentation testpoints deep inside Firefox JavaScript JIT engine within two days from zero-knowledge, quicker there on.

- Sourcetrail (GUI/Linux/Windows, closed-net-capable, archived) - https://github.com/CoatiSoftware/Sourcetrail

- SourceInsight (repo/web-server/closed-net) https://www.sourceinsight.com/

- OpenGrok (web server plug-in/Java/closed-net) https://oracle.github.io/opengrok/

- CLion (GUI-based IDE) - By IntelliJ/JetBrains - https://www.jetbrains.com/clion/

- SourceGraph (Web-based) https://about.sourcegraph.com (thanks, gravypod)

- Codesee.io (GitHub/web-based) - https://www.codesee.io/privacy-and-security

For free as in beer, I prefer OpenGrok so I can get more than JetBrains


There is also a company called SourceGraph which parses the AST of the various languages and lets you get CLion-like features with regex search.

https://about.sourcegraph.com/


Thanks, shimmed that link in the parent.


+1 SourceInsight, it was very helpful when working on a large legacy C/C++ codebase.


- Understand (GUI on Windows/Linux/Mac) - https://www.scitools.com/


Off the tools topic, but IMO the most important consideration: the mindset should be entirely about understanding the existing approach, conventions and philosophy, vs a critical assessment leading to "this needs to be modernized". Particularly on small-medium codebases with smaller teams, I've seen projects be fundamentally damaged by new, well-meaning devs who bypass most of the hard-slog of really understanding the existing how/why, and instead try to jump to the more comfortable space of using tooling or approaches they're more familiar with. There is certainly a place for that, but depending on the project, that might be 3 to 6 months later. Programmers need to appreciate the power and consequences for management and non-programming team members (product) when a new dev brings a condemning assessment of an existing codebase after one or two weeks.


What really undermines that necessity is this whole thing of having new developers "hit the ground running", meaning they're not just given light tasks for the first few months to help them understand the codebase and philosophy, but full blown important tasks with deadlines. I've had this happen to me more than once and I think it kind of acts to gaslight the new developer into initially thinking that the problem is them and not the codebase. Although I made it through building out a context menu UI from scratch as a first task on a previous project I was hired on to, that probably wasn't the best thing for that reason.

I get that new devs need a mindset to understand the existing approaches, but tossing them right into the middle of the foray actually makes things worse.

And yeah, teams need to be able to respect and respond to the frustrations of new developers. If a new developer, especially a senior developer, needs hours or days and multiple approvals to do something as simple as change a string, for instance, something is seriously wrong and you need to actually re-evaluate what everyone is doing.

Yes, I've experienced that exact phenomenon, and it's amazing how little people are concerned by that kind of thing. I'm not the greatest programmer in the world by any means, but I know how to change a string and when a design is ridiculously inefficient. Yes, there was a project I was on-boarded to where one of the tasks was changing how a string appeared in the UI; it turned out there was a shared library carrying the target string, but it wasn't exactly obvious that was where the string was coming from, and multiple other projects were using that library and so the review process needed to apply to all projects involved. For some reason, although a role was delegated to deploying the app itself, it's up to individual devs to build the library and register it as an NPM package, and there was a tight protocol around. Because of the toolchain, the library was not easy to test with an app locally. So it ended up that changing a string could take several hours or even days, especially if someone wanted to squeeze in another string change into that same task.

As a developer who just joined a team, what are you really supposed to do? Just plow through believing that there's a good reason for the madness?

Maybe the existing approach is actually bad.


Agreed on the hit-the-ground-running problem. Related to your final point, assuming that you're not coming into a situation well-known by all parties to be a shit show, I think there needs to be a Minimum Patience Period (probably relative to the size/complexity of the project, window given to get up to speed, etc). For example, if the product is working well, the release cadence is acceptable, the defect rate is low, etc, I think the prior devs have earned a reasonable benefit-of-the-doubt MPP that a new dev should accommodate for learning the existing approach and rationales. I've seen a senior dev completely change tooling and build process for a "fine" app at the end of their second week. All good ideas in the right time, but...


> I think there needs to be a Minimum Patience Period

100%. That can be very, very hard, but yes. It's also got to go both ways. New devs need an MPP for the project, and the rest of the team and management needs an MPP in relation to the new dev.

> if the product is working well, the release cadence is acceptable, the defect rate is low, etc, I think the prior devs have earned a reasonable benefit-of-the-doubt MPP that a new dev should accommodate for learning the existing approach and rationales.

That's often true, though release cadence and defects in production can also be low if you hire a crapton of devs to work on the same project despite any present dysfunction. So that may be something to keep in mind, too. And maybe that means the new dev realizes that they need to change their expectations such that the underpinnings of a project are conceptually flawed and that it's really their job to deal with that given that the company has plenty of revenue that they can just hire however many devs it takes to keep things running.

> I've seen a senior dev completely change tooling and build process for a "fine" app at the end of their second week. All good ideas in the right time, but...

Haha, ohhh yes, I've been on both sides of that coin. This problem really highlights how the difficult part of coding isn't so much the code but the people (and in a good way!). I've made that mistake of coming in and fiddling with, kind of wrecking, the existing process. Some of that is overambition, some of it is the lack of formal training in our field, and some of it is the ambiguity of roles that is quite common. The word "senior" can illicit this self image that you can and should make sweeping decisions. I've appreciated it when teams have been able to communicate expectations upon joining rather than just saying "it's up to you" for everything; that is until you actually do change something like a part of the toolchain, resulting in technical and interpersonal issues.

In my next role as a senior, I plan on going in with the mindset of not changing the foundation of a project except requesting removing the spandrels and canards that no longer serve a purpose.


Whenever working with a new codebase I always try to find the route definitions file first. Something that maps the api interface to the functions they call. I can then reason backwards from any service by clicking into whatever I’m interested in. After that I look for where config is defined and try to understand what’s unique about this envs setup.


I like this approach because it keeps the fact that code is supposed to DO SOMETHING front and center. There is so much temptation (and many excellent reasons) to organize code in various ways but I think the power of route-based thinking should not be underestimated. Obviously depends on what you're building, but for most web-apps, I would trade some code duplication for very clear "this-does-that" organization.


Whenever I work on huge codebase (think 1M+ lines of code), I always reach for Russ Cox's codesearch https://github.com/google/codesearch. It requires indexing the codebase first, which takes 15 minutes or so, but after that searches are instant.


I reach for Hound-search[0] (originally Etsy Houndd) that uses Russ Cox's "Regular Expression Matching with a Trigram Index"

I had even made self-serve hosting for it but didn't put much effort into monetizing or otherwise promoting it.

[0] https://github.com/hound-search/hound


Ooo yeah I think I've came across Hound before, but I don't like that the only way to interact with it is through the web UI.


I made a web site that catalogues how various companies/projects use code search:

https://codesearchguide.org/story/google

https://codesearchguide.org/story/facebook

https://codesearchguide.org/story/brave

https://codesearchguide.org/story/chromium-android

https://codesearchguide.org/story/linux

https://codesearchguide.org/story/yelp

https://codesearchguide.org/story/stripe

The Google one in particular has a great breakdown of how they use code search by use case (examples, exploration, etc.).

And here are a bunch of known code search tools: https://codesearchguide.org/tools

(Disclaimer: I am the Sourcegraph CEO and our core product is code search.)


Interesting, Amazon had a meh internal code search, much better than current GitHub tho.


Would you (or anyone else) be willing to write up a page for that site to document what Amazon's internal code search is like (from publicly available sources)?


Im not sure much is public info. I'm not there anymore but perhaps someone who is can.

Its called GitFarm and had a web UI where you could search with a few different operators, like exact word, path, file extension. They kinda half implemented searching by class name just for Java too. Another thing, you couldn't add a NOT, or exclude certain filepaths, which was annoying especially when ppl would accidentally commit node_modules. Overall, it wasn't too bad buy feels like it had kinda stagnated recently (KTLO).


I almost always fall back on ag (https://github.com/ggreer/the_silver_searcher).

Honourable mentions to cscope and ctags. They work for me since most of my $dayjob involves me mucking around with C++.

All tools get invoked from within Vim. (Which _also_ works reasonably well in Windows Terminal).


I use cscope and tags. Making sure vim is set up to navigate by tag is a must too. But given those I always wondered what the other methods mentioned in this news item offer that cscope/tags doesn't


I keep a directory with up-to-date clones of all relevant repos. This is separate from my usual working directory. I experimented with git workspaces, but it wasn't worth the trouble, especially since the set of repos I'm working on is not necessarily the same as the ones I keep in my search dir.

At the root level I maintain two scripts:

  clone.sh

  update.sh
clone.sh has one

  git clone --recursive .. 
line per repo. When I run low on disk space I sometimes delete larger repos. The clone script allows me to easily re-clone everything in this case.

update.sh is similar but pulls all repos.

For global search across all branches I do:

  git grep <regexp> $(git rev-list --all)
(when I forget the line I look it up in Stack Overflow [1])

This is especially useful since I work a lot with Bitbucket and to the best of my knowledge you can only search the default branches there.

When I know the branch, but want to search across all history I use git pickaxe, aka

git -S ...

All of this is not very sophisticated and takes a lot of disk space but it works pretty well for me.

[1] https://stackoverflow.com/a/15293283


You can combine git grep with fzf to fuzzy search commit message and also see preview of the selected commit. See https://gist.github.com/junegunn/8b572b8d4b5eddd8b85e5f4d40f...


git grep is amazing. I almost never need an alternative.


same here :)


At Oracle (keep in mind every org within Oracle is very different) I wrote a crappy script to grab all repos in my org and hook them up to Etsy Hound [0]. But I couldn't stick with it long enough for it to be useful.

It's surprising to me how much effort is not being put into whole-org code search. Most projects focus solely on single-repo search. If you need to make breaking changes or find examples and you don't even know where to look, single-repo search isn't so useful.

[0] https://github.com/hound-search/hound


Thanks! It is also a concern that I was looking for an answer!


IntelliJ / PyCharm / WebStorm ctrl-shift-F: search in the whole codebase, is what I use


IntelliJ is struggling noticeably navigating even my tiny 50 KLOC search engine project. I can't even imagine using its search function to get around MLOC-scale projects.


That sounds like you have a config issue. Maybe indexing is turned off? Do you get mem low errors? https://www.jetbrains.com/help/idea/increasing-memory-heap.h...


It really has no issue as long as you have enough RAM and disabled swap. I recently got a fairly decent gaming laptop that I'm supposed to develop on and the 16 GB RAM is constantly full and it's swapping all the time, browser keeps suspending my background tabs, really pain in the ass.


You really do not want to disable swap on Linux. The operating system does not handle out of memory situations well at all if you do.

At any rate, I've got 32 GB of ram and I'm operating off an NVMe-drive, I really don't see what the problem is.


Possibly dumb question, have you adjusted the maximum memory heap size that the JVM can allocate for running IntelliJ?


You're right, I should've mentioned this only applies to Windows pagefile rather than swap.


Hmm, just checked. The two projects I have open now have 510 KLOC and the other one 200 KLOC, and searching in both feels instant. (This is not including node_modules and python libraries/caches that it also searches through by default)


Do you have indexing turned off for your project or has indexing not completed? Indexing speeds up your searches.


It works great for me with much larger projects.


Same experience unfortunately. As much as I love Jetbrains, these features dont scale with codebases like the microservice I worked on at my X 10s of billions of dollars market cap tech company, where our codebase was maybe the 10th largest.



I wrote several scripts that temporarily add call tracing to the source code. The scripts take a few days to write but I've used them on many projects. Note there are code parsing libraries that can help you.

For example:

def foo(args):

   trace_func("foo", args)

   # rest of the function

(NOTE There is call tracing logic built into some languages but it doesn't always work for some complex code bases; try it before you write your own.)

If the code creates html elements, my script adds attributes to the html element to link back the source code location so I can look at the html and figure out where the elements were created. If the html is built using templates, then I add html comments to the template so I can tell where they are used in the final page.

Then I test the app and look at the traces to figure how it works.

At first I trace everything but once I get to know the code I add the tracing to the areas that matter. I don't check in this code.


There's no one perfect solution for the issue. A few things that I've found helpful across many years of work, codebases, languages, and frameworks:

What if any frameworks and libraries is it using? Try to identify particularly core frameworks that tend to dictate the whole workflow of the application. Many frameworks have standards of file organization and system architecture that can help you get a handle on what goes where. They may not always have been used properly, but it's a start at least. It might even help to set up a small learning project in that framework just to get to know it better. There may also be libraries in use that influence a lot of how the application does whatever it does.

Trace control flows of the application. How does it start? Do any other processes get started in addition to the main application? Learn how to do the workflow you need to modify, or the closest one to it if you're making a new one. Trace how the command to do X first gets into the application (API call? GUI button press? Some kind of messaging system trigger?), and try to follow the code to see what it does and how it does it.

Trace data flows. Where does the application store critical data, and how does that data actually get picked up from there, transformed, and eventually used, to present to the user or get transformed and handed off to some other system or whatever?

Text search of the codebase can be useful. In strongly-typed languages, often IDE tools are better at jumping straight to the code of the actual method being called though. In less typed languages, text search might be better. Or if whoever wrote the thing did a bunch of dynamic trickery, you may need to resort to running the code, in a unit test if it actually exists, or in your test environment, and attaching a debugger or adding a bunch of log statements.

It's always helpful to understand the business logic of what the application is actually trying to do, and the perspective of developers more experienced with it, if any such people are actually available.

Usually you need to do all of the above to actually develop expertise in a new codebase. Sometimes you have to not be afraid to just jump in and try doing stuff, even if it might not be the best way.


This skill is what differentiates really good developers from not so good developers (that may wrongly believe themselves they are good):

How good are you at reading code, finding out how smaller parts work in a larger system and understanding the context&domain is a large part in how good you are.

Not so good developers, when they are not so good at this, often start blaming the system and people who have worked on it.

Not so good developers may however be able to handle smaller systems (in particular ones written in their favorite tech stack), and this experience leads them to erroneously believe they are good.


[flagged]


I think the gist of your comment is reasonable, but it reads as kind of adversarial. Something like:

> Being able to navigate a large, complicated code base is a really important skill, and I think one of the important marks of a senior engineer. A big part of it is also just having the grit to dive in and understand a messy system without having things neatly laid out for you.

Says ~mostly the same thing, but without the condescension.


Maybe your comment is off topic? We are talking about searching. You are talking about good vs not good developers and understanding code. Searching != understanding.


You can't search without a recognizer and you can't recognize without some kind of understanding.

As OP says, one of the skills/talents that differentiates good vs not good developers is the ability to rapidly "grok" enough of a large codebase to figure out "where to tap". ( https://quoteinvestigator.com/2017/03/06/tap/ )


Knowing how to search in a large code base is a large part of this and could be included in what ”good at reading code” means.


This has been a great discussion for me as someone that has been programming as a non-profession for over a decade, but is new to applying it to existing projects that are not mine through open-source.

One tool I haven't been able to find that I feel would be super helpful in the IDE is to show where code is covered in tests, like contexts when using python's `coverage`. Does anything like this exist? The benefits are two-fold: they help show me how the methods are supposed to be used, and also guide me on how and where I should test my fix or feature.


Tools I use:

- grep

- ag - Same as grep, but faster!

- find - when looking for a file by name

- helm (an epic Emacs package which does interactive search)

Used to work at a Windows shop and we used Entrian in visual studio. That was pretty good, bust closed source and a pain to setup.


How do you get those tools to work with git branches?

We use bitbucket at work which goes out of its way to stink regarding search. But I personally find it hard to find some code in some branch of some repository that uses x.


Honestly I never really needed to search across multiple branches*. What I want to find is usually in master or in my own dev branch. Sometimes though I need to search older versions of the code and I always find `gitk` useful for that.

* Except at the windows shop where Azure Devops would provide ways to search multiple TFS branches of our product. Life is much better in companies where the master branch is the source of truth and you don't have to maintain different branches.


Especially if this is long term, this is a great tool:

https://github.com/hound-search/hound#hound

It would be great if someone integrated this with tree-sitter plus something to make the search semantics a bit smarter about usages of X:

https://www.etsy.com/codeascraft/announcing-hound-a-lightnin...

Screenshots:

https://jaxenter.com/hound-go-react-code-search-engine-15008...

Another trick I use for Java: javap all the Enums out of the compiled artifacts; these indicate weird things like "modes" that you can use to start asking questions relevant to the domain. Like "why are there four ways to reprice an invoice" or finding the "types" of fees or w/e in a billing system. (assuming enum classes are used)


I use a combination of

(1) breakpoint debugging, finding the connection between program start and various features

(2) Doxygen to generate a dependency graph

(3) create json performance profiles, manually instrumenting functions, and navigate traces using Google Chrome about://tracing or similar tools.

(4) trace and look at the data input and output, using a hex editor or over the network using wireshark


I should add small point for a story, you pobably wont not realize right solution after initial research. please use expert to check prepared idea and repeat on fail:

1. search with advanced tools or scripts that you wrote to find concrete answers in the code. 2. draw graph of knowledge what youu have, steps, undersand how these knowledge may help to resolve an issue. 3. go to reviewer with the plan. 4. if expert make dicision you plan will not work, then repeat step 1. 5. you may implement fix for an issue.


I ask other developers questions. Oh, they're busy? Well I don't really care because the sooner I get up to speed the less of a hassle I'll be to everyone in the long run. (EDIT: Yes, I'm being jocular with my use of hyperbole) All the documentation and grepping in the world can't make up for the intimate knowledge of those who've been on a project for a meaningful amount of time. It's surprising how people can point you to the right place in a codebase without searching, grepping, or any of that.


> Oh, they're busy? Well I don't really care

Because your time is so much more valuable than theirs!


Lol Yeah, I'm at fault for being hyperbolic on the internet.

In real life, I'm absolutely considerate of people's time. I just don't hesitate to ask questions early on and give others an "out" if they're strapped for time. It's just that I don't get it into my head that their time is so much more valuable than mine that I can't interrupt them to ask dumb questions.


I find that the more senior you become, e.g. the longer you spend on the job, the easier it is to interrupt your work to help a colleague.

Not all are that way though, which can be annoying, but i sincerely hope they can recognize a question only comes up because it is needed to understand the subject.

Tight deadlines or production issues should take precedence though, but other than that you really should try your best to help.


Clone, ripgrep (rg). Learning how to navigate shitty code has helped me in more ways than one - one of which being I don't have to rely on extensive tooling to understand codebases.


First thing I do is cleaning up the code base. I'm deleting unused code, fix class and method visibility according to usage, do the occasional rename if naming is inconsistent, check error handling and logging for inconsistencies, write the occasional unit test... This gives me a broad overview over the code base, improves my chances to find anything by text search and enables me to better assess the impact of changes.


0. Try to find what & why exactly is being built

1. Try to find out which framework, architecture, design patterns used - get hold of that

2. Library dependancy 3. Database structure

4. Pick up your favourite editor (be it vim or emacs or vs code or any) in which you have mastery

5. Search for various entry points like routes, or start activity or main function etc & try step thru code (with possible debug tools open)


The first thing I always do is 'read' the data model. What are the tables called? What are the relationships and cardinalities. Combining that with the source can give you a head start into being able to extract relevant conceptual information that's (strangely) rarely documented.


I can recommend going beyond the source code and check the checkin comment or bugtracker entry of the code you intend to change (once you found it). If the code is strange or arbitrary there is often a reason, and knowing that will improve your understanding.


You can almost always ask for help from a colleague who has seen the codebase.


It is common that the developer who built might already left the team.


Seen != Built.

You don't always have to optimize for the right answer right away. Just asking someone who's worked with/on the codebase will often give you a jump start.


If someone worked on JIRA/Task we can find, otherwise, how do we even know about who has seen the codebase?


Presumably the code base didn't just have a single author (who is also no longer around). Git blame is amazing for not only tracking down who did what, but also seeing how the code base evolved.


You ask your coworkers, "have you seen this before?"


I use git blame a lot


Look at existing tests. Existing tests usually setup things in separation. Find tests related to what I’m looking for, break it down, go from there.


Concatenate all the source code files into a single file, with pathnames inserted between files.

Then use Vim to read the concatenation and (regexp) search.


What is the benefit of this approach versus searching the codebase using tools meant for codesearch? Doesn't this fall over for medium and larger codebases? For example, my current org has over a thousand projects in a single monorepo comprising millions of lines of code in a couple of different languages.


Do whatever works for you.

I use Vim on a concatenation. It's a simple technique that allows me to search, read, annotate, and understand how everything fits together in a medium sized C++ codebase full of templates (i.e., a pile of garbage).


find & grep have been invaluable tools for me to begin grokking an unknown codebase, esp in the absence of more tailored tools. if I happen to have an IDE avail which fits a use case and has whizbang search or visualization, I'll use it

also: GraphViz is a great tool and CLI friendly


Codesee.io seems like it was made for this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: