I'm not entirely sure that you want a submodule here, but submodules will let you do what you are describing. Submodules are tricky, though. There's a reason people call them sob-modules.
Long
First, it will help a great deal if you get your definitions—actors and actions—straight:
A repository does not push anything. It's just a collection of commits (plus some names; see the last point below).
Git (the software suite) creates and manipulates repositories, including the commits inside them.
The git push
command pushes commits.
A commit is a thingy (technically, a commit object, but people use the term pretty loosely, hence the loose "thingy" term here ) with the following features:
- It has a unique hash ID.
- It stores files. Note that commits do not store folders, just files. These files have path names that include embedded slashes (always forward slashes, even if you extract the commit's files on a Windows sytem with their backward slashes). This eventually becomes important later, but if you like, you can think of them as folders-full-of-files, as long as you remember that Git can't store an empty folder properly (because it only stores files). The files are stored as full snapshots, although they get compressed and—importantly—de-duplicated across all commits in a repository. So, the fact that typically some new commit re-uses 30,000 files from some previous commit doesn't matter: the re-used files take no space, because they're literally re-used.
- It stores metadata, or information about the commit itself. This includes stuff like who made the commit and when, and a log message, and so on; and, crucial for Git's own operation, it also includes the raw hash ID of some set of earlier commits. Most commits store just one earlier-commit hash ID, which we (and Git) call the parent. This is how history works, in a Git repository: each commit remembers its parent.
- It is completely read-only. No part of any commit can ever be changed. (This is what allows the de-duplication, and a lot of other Git magic.)
A repository also contains names—such as branch and tag names—that allow Git to find commits. This works by having one name store exactly one hash ID. For branch names, that stored hash ID is, by definition, the last commit in the branch. Since commits store parent hash IDs, Git can work backwards from whichever commit we decide to call "last in branch X
": X~1
is the second-to-last in X
, X~2
is the third-to-last, and so on.
The act of adding a new commit to a branch consists of the following steps:
You check out that commit (with git checkout
or git switch
) by checking out that branch (with the same command), so that this is now the current branch. This action fills in both Git's index—which holds your proposed next commit—and your working tree, where Git copies out all the files into a usable form. The internal, de-duplicated form is generally unusable to everything except Git itself.
You do some stuff in your working tree. Git has zero control or influence over this part, a lot of the time, since you'll be using your own editor or compiler or whatever. You can use Git commands here and then Git will be able to see what you did, but mostly, Git doesn't have to care, because we move on to step 3:
You run git add
. This instructs Git to take a look at the updated working tree files. Git will copy these updated files back into Git's index (aka the staging area), in their updated form, re-compressing and de-duplicating them and generally making them ready for the next commit.
You run git commit
. This packages up new metadata—your name, the current date and time, a log message, and so on—and adds the current commit's hash ID to make up the metadata for the new commit. The new commit's parent will thus be the current commit. Git then snapshots everything in the index at this time (which is why git checkout
filled it in, in step 1, and then git add
updated it in step 3), along with the metadata, to make the new commit. This gives the new commit its new hash ID, which is actually just a cryptographic checksum of the entire data set here.
It's at this point that the magic happens: git commit
writes the new commit's hash ID into the current branch name. So now, the last commit on the branch is your new commit. This is how a branch grows, one commit at a time. No existing commit changes—none can change—but the new commit points back to what was the last commit, and is now the second-to-last commit. The branch name moves.
You really need to have all of these down pretty cold to make submodules work, because submodules actually use all of this stuff, but then violate some rules. Now it starts to get tricky. We also need to look more closely at git push
, just for a moment.
git push
: cross-connecting one Git repository with another
Making a new Git commit, in some Git repository, just makes a new snapshot-plus-metadata. The next trick is to get that commit into some other Git repository.
If we start with two otherwise-identical Git repositories, each has some set of commits and some branch names identifying the same last commit:
... <-F <-G <-H <--branch-name [in Repo A]
and the same in Repo B. But then, over in Repo A, we do:
git checkout branch-name
<do stuff>
git commit
which causes repo A to contain:
...--F--G--H--I <-- branch-name
(I get lazy and don't bother drawing the commit-to-commit arrows correctly here). New commit I
—<code>I, like H
and G
and F
, stands in for some big ugly random-looking hash ID—points back to existing commit H
. You might even make more than one new commit:
...--F--G--H--I--J <-- branch-name
Now you run git push origin branch-name
, to send your new commits, in your repository, back to the "origin" repo (which we were calling "repo B" before, but let's call it origin
now).
Your Git software suite ("your Git") calls up theirs. Your Git lists out the hash ID of your latest commit, i.e., commit J
. Their Git checks in their repository, to see if they have J
, by hash ID. They don't (because you just made it). So their Git tells your Git: OK, gimme! Your Git is now obligated to offer J
's parent I
. They check and don't have I
either, so they ask for that one too. Your Git is now obligated to offer commit H
. They check and—hey!—this time they do have commit H
already, so they say: no thanks, I have that one already.
Your Git now knows not only that you must send commits J
and I
, but also which files they already have. They have commit H
, so they must have commit G
too, and commit F
, and so on. They have all the de-duplicated files that go with those commits. So your Git software suite can now compute a minimal set of stuff to send them so that they can reconstruct commits I-J
.
Your Git does so; that's the "counting" and "compressing" and so on that you see. Their Git receives this stuff, unpacks it, and adds the new commits to their repository. They now have:
...--F--G--H <-- branch-name
\
I--J
in their Git repository. Now we hit a really tricky bit: How does a Git, in general, find a commit? The answer is always, ultimately, by its hash ID—but that just brings another question, which is: how does a Git find a hash ID? They look random.
We already said this earlier though: a Git (the software suite) often finds some specific commit in some specific repository through the use of a branch name. The branch name branch-name
, in your repository, finds the last commit, which is now J
. We'd like the same name in their repository to find the same last commit.
So, your Git software now asks their Git to set their repository's branch name branch-name
to identify commit J
. They will do this if you are allowed to do this. The "allowed" part can get arbitrarily complicated—sites like GitHub and Bitbucket add all kinds of permissions and rules here—but if we assume that it's OK, and that they'll do that, then they will end up with:
...--F--G--H--I--J <-- branch-name
in their repository, and your Git repository and their Git repository will be in sync again, at least for this particular branch name.
So that's how git push
normally works: you make new commits, adding them on to the end of your branch, and then you send your new commits to some other Git, and ask their software to add the same commits to the end of a branch of the same name in their repository. (Whew!)
Submodules
A submodule, in Git, is little more than two separate, mostly-independent Git repositories. This of course needs a lot of explanation:
- Why are they only "mostly" independent? (What does that even mean?)
- If they're little more, what more are they?
First, like any repository, a submodule repository is a collection of commits, each with a unique hash ID. We—or Git at least—like to refer to one of the two repositories as the superproject and the other as the submodule. Both of these start with the letter S, which is annoying, and both words are long and klunky, so here I'll use R (in bold like this) as the superproject Repository, and S as the Submodule.
(Side note: the hash IDs in R and S are independent from each other. Git tries pretty hard—and usually succeeds—at making hash IDs globally unique across every Git repository everywhere in the universe. So there's no need to worry about "contaminating" R with S IDs or vice versa. In any case we can just treat every commit hash ID as if it's totally unique. Normally, with a normal non-R non-S repository, we don't even have to care about IDs, as we just use names. But submodules make you have to be more aware of the IDs.)
What makes R a superproject in the first place is that it lists raw hash IDs from S. It also has to list instructions: if we've done a git clone
of R, we don't even have a clone of S yet. So R needs to contain the instructions so that your Git software can make a clone of S.
The instructions you give to git clone
are pretty simple:
git clone <url> <path>
(where the path
part is even optional, but here, R will always specify a path—using those forward slash path names we mentioned earlier). This set of instructions goes into a file named .gitmodules
. The git submodule add
command will set up this file in R for you. It's important to use it, to set up the .gitmodules
file. Git will still make a submodule even if you don't set this up, but without the cloning instructions, the submodule won't actually work.
Note that there's no proper place to put authentication (user and password names) in here. That's a generic submodule issue. (You can put them in as plaintext in the .gitmodules
file, but don't do it, it's a very bad idea, they're not encrypted or protected.) As long as you have open access to cloning the submodule, it doesn't normally present any real problem. If you don't, you'll have to solve this problem somehow.
In any case, you will need, just once, to run:
git submodule add ...
(filling in the ...
part) in what will thus become superproject R, so as to create the .gitmodules
file. You then need to commit the resulting .gitmodules
file, so that people who clone R and check out a commit that contains that file, get that file, so that their Git software can run the git clone
command to create S on their system.
You'll also need to put S somewhere they can clone it. This, of course, means that first you need to create a Git repository to hold S. You do this the way you make any Git repository:
git init
or:
git clone
(locally, on your machine) along with whatever you do on whatever hosting site that creates the repository there.
Now that you have a local repository S, you need to put some commit(s) into it. What goes into these commits?
Well, you already said that you'd like your R to have a build/
directory (folder) in it, but not actually store any of the built files in any of the commits made in R. This is where submodules actually work. A submodule, in R, for S, works by saying: create me a folder here, then clone the submodule into the folder. Or, if the submodule repository already exists—as it will when you're setting all this up in the first place, with you just now having created S
—you simply put that entire repository into your working tree for R, under the name build
.
Note that build/.git
will exist in R's working tree at this point. That's because a Git repository hides all the Git files in the .git
directory (folder) at the top level of the working tree. So your new, empty S repository consists of just a .git/
containing Git files.
You can now run that git submodule add
command in R, because now you have the submodule in place:
git submodule add <url> build
(You might want to wait just a little bit, but you can definitely do it at this point—and this is the earliest point at which you can do it, since up until now, S didn't exist or was not in the right place yet.)
You can now fill the build/
directory that lives in R's working tree with files, e.g., by running npm run build
, or whatever it is that populates the build/
directory. Then you can:
(cd build; git add .)
or equivalent, so as to add the build output in S. You can now create the first commit in S, or maybe as the second commit in S if you like to create a README.md
and LICENSE
and such as your initial commit. You can now have branches in S as well, since you now have at least one commit in S.
Now that you're back in R though, it's time to git add build
—or, if you chose to delay it, run that first git submodule add
. In the future you'll use git add build
. This directs the Git that is manipulating the index / staging-area for R to enter the repository S and run:
git rev-parse HEAD
to find the raw hash ID of the current commit in S.
The superproject's Git repository's index now acquires a new gitlink entry. A gitlink entry is like a regular file, except that instead of git checkout
checking it out as a file, it provides a raw hash ID. That's basically all it is: a pathname—in this case, build/
—and a raw hash ID.
This gitlink is like one of those read-only, compressed, and de-duplicated files that goes in a commit. It's just that instead of storing file data, it stores a commit hash ID. That hash ID is that of some commit in S, not some commit in R itself. But now that you've updated the index (or staging area) for R, you will need to make a new commit in R. The new commit will contain any updated files, plus the right hash ID for S, as found just now by the git add
you ran (or that git submodule add
ran for you).
The next commit you make in R (not in S) will list the hash ID of the current commit in S. So once you've committed the built files in S, you can git add
them in R and git commit
in R.
The last and trickiest part
Now comes the last part, which—if you thought all of the above was complicated and tricky—is the trickiest:
You have to git push
the submodule commit in S so that it's generally available. In general, you should do this first, though you don't actually have to.
Then you have to git push
the superproject commit in R so that others can get it. When others get this commit from the other clone of R, they'll be able to see the right hash ID from S.
Then, if someone else—let's say your co-worker Bob—wants to get both the built files and the sources, they have to:
- Obtain your new R commit.
- Instruct their Git to check out the new R commit.
- Instruct their Git to use the new checked out R commit to run
git fetch
in S so as to obtain the new S commit.
- Instruct their Git to actually enter their clone of S and
git checkout
the correct commit.
They can do this all at once with git checkout --recursive
, or set the recursive checkout option. Note what can go wrong though:
They might obtain your new R commit and check it out, but forget to update their S at all.
Or, they might obtain your new R commit and check it out and then try to check out the new commit in S without first running git fetch
in their clone of S, so that they don't have the new commit.
Or, they might remember everything they should do, but someone forgot to push the new S commit to the shared repository people can get it from. They'll get an error about their submodule Git being unable to find the requested commit.
You can see how this can get pretty messy. It's very easy for the various separate commits to get de-synchronized in various ways. Once you have the procedures down, and have scripts around everything that make sure that all the steps happen at the right times, it can work pretty well. But there are many ways for things to go wrong.