Josh: Just One Single History

skybrian · on Feb 21, 2024

To sum up, it looks like this is a server that provides smaller views of a large monorepo. A simple case is checking out a subdirectory of the repo and just getting the files and history for that subdirectory.

They also advocate using these views to do builds. That way, your build can't depend on any files that are outside its view. It's a kind of sandboxing for builds. [1]

> In particular when using C family languages, hidden dependencies on header files are easy to miss. For this reason limiting the visibility of files to the compiler by sandboxing is pretty much a requirement for reproducible builds.

> With Josh, each deliverable gets its own virtual git repository with dependencies declared in the workspace.josh file. This means answering the above question becomes as simple as comparing commit ids. Furthermore due to the tree filtering each build is guaranteed to be perfectly sandboxed and only sees those parts of the monorepo that have actually been mapped.

> This also means the deliverables to be re-built can be determined without cloning any repos like typically necessary with normal build tools.

Bazel users might find this familiar. It does a similar thing by running build tools in a temporary directory that contains symlinks to your actual source code.

But it's not something you actually need if you're not using a monorepo, since your package manager won't pull in other packages that you don't depend on.

[1] https://josh-project.github.io/josh/usecases.html

jotaen · on Feb 20, 2024

> Josh combines the advantages of monorepos with those of multirepos by leveraging a blazingly-fast, incremental, and reversible implementation of git history filtering.

It might be just me, but I have some serious trouble to understand what this actually is, and how I would use it specifically.

From the homepage, I understand that this is some sort of utility around working with git histories in mono repos. But is Josh a client-side CLI tool? Is it a convention for structuring mono repos? A web service? A complete VCS? Or some combination of all that?

p_l · on Feb 21, 2024

It works by creating a proxy git remote which transparently rewrites history in ways that keep both sides in sync.

digdugdirk · on Feb 21, 2024

Transparently in the "I don't have to worry about it" way?

Or transparently in the "when something goes wrong I'll have no way to tell what happened" way?

p_l · on Feb 21, 2024

Well, if something goes wrong you tend to have multiple repos that at worst will be recoverable with some amount of lost history. Benefit of Git being a DVCS.

For most of the time you don't have to worry about it other than when you initially set things up.

parhamn · on Feb 21, 2024

It's almost always one then the other, no?

crotchfire · on Feb 21, 2024

Do not try and bend the spoon, that's impossible. Instead, only try to realize the truth...there is no spoon. Then you will see it is not the spoon that bends, it is only yourself

-- what Pijul users say when they overhear git users arguing with each other about monorepos.

https://pijul.org/

Git insists that you impose an artificial ordering on commits which affect disjoint sets of files. And then people come up with heroic efforts like josh to try to bandage over that mistake. Or you could just not make the mistake in the first place...

bognition · on Feb 21, 2024

What alternative are you suggesting?

exDM69 · on Feb 21, 2024

Pijul is a version control system that is patch based (like Darcs) instead of snapshot based (like Git and Mercurial).

Together with some mathematical models on the properties of patches it can deal with some merge situations that need to be resolved manually in Git.

Pijul's website (and Darcs' before that) has some theory that may be helpful if you are interested in the details.

The distinction between patch based and snapshot based is mostly invisible to the end user, except when it isn't.

gugagore · on Feb 21, 2024

Couldn't git just recreate patches from snapshot? I mean, this is what happens when you show a commit or a diff.

This is probably what you mean by how it's invisible to the user. When does it actually make a difference?

exDM69 · on Feb 21, 2024

Yes, git "recreates" the patches when you view a diff.

But git can't decide that these two patches are independent of each other and thus their order does not matter so it'll give you a merge conflict where pijul doesn't.

You can losslessly convert between the two models, but you can't efficiently apply patch theory (commutation etc) to a bunch of snapshots.

gugagore · on Feb 21, 2024

This doesn't seem to be true. I can use git rebase to change the order of commits, and I don't get conflicts if the corresponding patches are independent.

If you can losslessly convert between the two models, then it seems like you could make git's conflict resolution smarter without changing the underlying data model.

exDM69 · on Feb 21, 2024

Yes, of course you can manually do that.

> If you can losslessly convert between the two models, then it seems like you could make git's conflict resolution smarter without changing the underlying data model.

No, this will turn into a performance nightmare.

The keyword is efficiently. A patch based data model is a requirement to efficiently do the kind of deduction on patch commutativity and other properties that Pijul does. Converting back and forth on the fly using snapshots (which are a list of filenames and blob hashes) would not work outside of toy examples.

And this is what eventually "killed" Darcs (an earlier patch based system), its data model had some exponential corner cases that could not be resolved.

gugagore · on Feb 21, 2024

I can understand in principle but not actually. Converting between snapshot and patch representations is not an exponential effort operation. You could convert a git history to pijul (yes, expensive, but not exponential), do the conflict resolution in pijul land, and then handwaved convert back to snapshots.

pmeunier · on Feb 21, 2024

The key insight of Pijul is to be the smallest generalisation of a file that is a CRDT with insertions and deletions of bytes as its two operations, where "smallest" and "file" are meant in a specific sense.

The main thing that makes it all work is the extreme performance of its storage backend, which allows to manipulate a graph datastructure directly on disk, and avoid as many IO operations as possible while doing that. This works well, as all operations in Pijul (with some caveats) work in a time logarithmic in the size of history. And yet, it is slower than Git for some operations.

Therefore, what you're suggesting is linear in the size of history (importing), i.e. exponentially slower than Pijul, for every single merge!

alchemist1e9 · on Feb 21, 2024

I was such a heavy Darcs user once, before git existed. Then for a while I was using darcs to git and back. Sounds like I should take a serious look at Pijul.

crotchfire · on Feb 21, 2024

It's getting pretty good.

The problem is the primary author demanding CLAs for changes to the "plumbing" part of pijul (but not the "porcelain"). There are a handful of outside contributors who agreed to it for small drive-by fixes, but a lot of others (like myself) who saw the CLA and said "yeah no thanks these always end with somebody getting screwed".

It's a real tradgedy. Hopefully the pijul plumbing is close enough to "perfect and finished forever" (like TeX, djbdns, etc) that it can survive for the rest of its life on one serious contributor and low-effort drive-by bugfixes. Otherwise the only plausible outcomes are a no-CLA fork (highly unlikely since the main author wouldn't participate) or the project dying.

I do hope the primary pijul author will rethink their "no limits" CLA. History has shown that these always end with contributors getting screwed, either by the primary author, or some other entity to whom the primary author sold the rights.

pmeunier · on Feb 21, 2024

This problem was never reported a single time on our Zulip or by email, but I'm glad you asked, because this choice wasn't obvious:

- The main reason the CLA is there is that earlier versions of Libpijul saw lots of online arguments about its very reasonable GPL2 license, which made me doubt of the choice of GPL2. These arguments were often followed by "@me was there"-style contributions to Libpijul, like applying linter fixes without understanding any of the code. I got scared of having to ask all past contributors for permission to release it under (say) a BSD license (which I have even recently discussed with the FreeBSD maintainers during FOSDEM 2024).

- Also, the fact that the Pijul binary is GPL2 (without a CLA) and dependent on Libpijul forces me, in the case of an evolution of the license, to choose something compatible with GPL2. Deal breaker, really?

- Another goal of the CLA was to experiment with something cool and meta, at the interactions between version control and licensing: I wanted to test the idea that the CLA was a single patch (or a sequence of patches, strictly ordered by dependency), and contributors would be required to add a dependency on the CLA patch. That would make them state in their patches which version of the CLA the agreed with when they recorded. And as Pijul does a lot of dogfooding…

- Other than that, the Pijul plumbing is indeed mostly meant to become just a static algorithm after a while, not many features are missing. The goal of Pijul is to have essentially three core functions, create a patch, apply a patch, unapply a patch.

crotchfire · on Feb 22, 2024

Thank you for replying!

> This problem was never reported a single time on our Zulip or by email

This shouldn't be surprising; the CLA requirement drives away contributors before their first contribution. Like me. I saw no point in a person who hasn't made any contributions petitioning you for a policy change. This phenomenon is one of the insidious effects of policies which ward off contributors before they make their first contribution.

A lot of us fell for CLAs when they were new and shiny. By now we've all seen way too many examples of people getting screwed by them. Fool me once, fool me twice... https://drewdevault.com/2023/07/04/Dont-sign-a-CLA-2.html

Using artificial pijul dependency arcs as CLA acknowledgements is certainly cute. But frankly, for legal matters, something that can't be misinterpreted like a `Developer-Certificate-of-Origin:` header is probably a better idea. By the way, have you noticed that `pijul git` import loses some of the git headers, like `Committer` and `CommitterDate`? Also, importing a large repo has exposed what appears to be some significantly superlogarithmic-time behaviors in `pijul record`.

> in the case of an evolution of the license

I think you should try to let go of the idea of "evolving the license" of an open source project that invites contributions.

If you want to go the sqlite route that is absolutely fine; honestly pijul is one of the few pieces of software that is (after accounting for younger age) on the same level of sophistication as sqlite. Just explain that your project, like sqlite, is open source but not soliciting contributions.

I think that with the CLA requirement you have effectively made this choice -- but in the way which maximizes the number of people annoyed: both the people who get triggered by any mention of the GNU project as well as the people who support its goals! This is quite a neat trick to pull off, although I fear it was unintentional.

> like applying linter fixes without understanding any of the code

Yeah I saw that PR, which was obnoxious for other reasons as well. Bravo to you and felix91gr for being so polite in your responses.

pmeunier · on Feb 22, 2024

First, thanks for the patient and kind advice. This is so rare I didn't even know you were allowed to talk like that online.

> I think you should try to let go of the idea of "evolving the license" of an open source project that invites contributions.

Yes, this is becoming clearer now, but wasn't for a long time, since choosing any license on a complicated tool like this one, which few people fully understand, inevitably sparks discussions on the license, because if you got to say something, it is easier to discuss than commutativity, pushouts (hard-core theory), Sanakirja, endianness (hard-core practice).

> This is quite a neat trick to pull off, although I fear it was unintentional.

;-)

> Yeah I saw that PR

There were many: a single one wouldn't have changed my mind. And many more on what projects/topics/software I should be interested in.

alkonaut · on Feb 21, 2024

This is a great idea. Where do apply this though? The reason I use git at all is because of the support in other systems. It's not the best VCS, it's the best one that I can create a new repo for in Github or Azure Devops, or that I can link with issues in some issue mgmt system.

If I'd need to self host a git repo, it would immediately lose most of its value. That's not to say it's useless longer term though. Other good extensions to git like Git LFS were also once in that situation, but wide adoption quickly meant that it's now a (mostly) integral part of git and every hosted git repository supports it. This would be very useful if it was adopted in the same way by the big repo hosters.

aseipp · on Feb 21, 2024

It doesn't manage a Git repository. It's a stateless proxy that sits in front of your existing Git repository. You just run it in front of your existing infrastructure. That's about it.

k8svet · on Feb 21, 2024

The site pretty clearly explains that the split repos are usable as normal git repos. And indeed I have a project I follow by cloning a josh managed repo. I didn't even know it was a josh repo until months after starting.

alkonaut · on Feb 21, 2024

Yes that I think is a given if there should be any traction. The big question is: how do I (or how long until I can) create a repo like this at the various git providers. Because to a rounding error, people don't create or host their own git repos. They are hosted on github, devops, etc. Not because it's better or because it's hard to self-host a git repo but because it's convenient and more importantly because any self hosted git usually means giving up important features in integration with CI systems, Issue management and so on. Offering one thing (e.g. better monorepo support) is not going to fly very well if it means giving up N other things. At that point it's mostly a tech demo like early Git-LFS was until the full support was on Github.

p_l · on Feb 21, 2024

You run JOSH locally, pointing your git client to it as a remote, and josh in turn is configured to point to multiple "normal" git repos anywhere you want.

k8svet · on Feb 21, 2024

Uh, what's going on with this thread? I'm pretty sure I didn't write this comment an hour ago (how it displays) and it didn't appear at the top of my comment history? Instead, appearing as "a day ago" in my history further down.

macintux · on Feb 21, 2024

Pure speculation: this post might have been selected under HN’s second chance pool[0]; perhaps comments are re-timestamped based on when the post was resurrected.

[0] https://news.ycombinator.com/item?id=26998308

djsavvy · on Feb 21, 2024

Weird. I don't see this comment, or the one you're referring to, on your profile either.

Are you saying that you didn't write the comment, or that the timestamp is off?

scubbo · on Feb 21, 2024

It's _amazing_ how much work monorepo fans put into "making a monorepo operate like a polyrepo", when they could just...

danpalmer · on Feb 21, 2024

I sort of get it. I don't think anyone coming from a multi-repo world really understands the full implications of a monorepo until they've worked in a large scale one (e.g. those at Google, Meta, or other places), so I think they try to keep certain kinds of practice that would ideally change in a monorepo world.

And I also don't think Git scales to that (Microsoft just about made it work with a custom filesystem, Google and Meta both use custom source control, Meta's is based on Mercurial), and therefore it's often necessary to split the monorepo into many repos and there exist a lot of tools for treating a set of Git repos as one Monorepo.

scubbo · on Feb 22, 2024

> I don't think anyone coming from a multi-repo world really understands the full implications of a monorepo until they've worked in a large scale one

That's entirely fair. My sole experience is the one black-sheep monorepo at my own relatively-recently joined company, which is nowhere even close to approaching true large scale.

Genuine question, though - what _are_ the advantages, as you see them (you didn't explicitly say as much, but I'm reading between the lines that you _can_ see some)? Every positive claim I've seen (primarily at https://monorepo.tools/, but also elsewhere) feels either flimsy, or outright false:

* "No overhead to create new projects - Use the existing CI setup" - I'm pretty confident that the amount of DX tooling work to make it super-smooth to create a new project is _dwarfed_ by the amount of work to make monorepos...work...

* "Atomic commits across projects // One version of everything" - this is...actively bad? If I make a change to my library, I also have to change every consumer of it (or, worse, synchronize with them to make their changes at the same time before I can merge)? Whereas, in a polyrepo situation, I can publish the new version of my library, and decoupled consumers can update their consumption when they want to

* "Developer mobility - Get a consistent way of building and testing applications" - it's perfectly easy to have a consistent experience across polyrepos, and or to have an inconsistent one in a monorepo. In fairness I will concede that a monorepo makes a consistent experience more _likely_, but that's a weak advantage at best. Monorepos _do_ make it significantly harder to _deliberately_ use different languages in different services, though, which is a perfectly cromulent thing to permit.

ixxie · on Feb 20, 2024

How does this improve monorepo workflow?

rapnie · on Feb 20, 2024

See the use cases: https://josh-project.github.io/josh/usecases.html

rapnie · on Feb 20, 2024

Repo link: https://github.com/josh-project/josh

wbl · on Feb 21, 2024

I was unaware that people who used perforce and git would go decide that git needed to work like perforce.