Although Git is an excellent tool for managing distributed repositories there are a few tasks that are common enough that you are probably going to do them more than once yet rare enough that there is no direct support for performing them on the Git command line. In this post I present a few helper utilities and step by step guides to perform some of the tasks I find myself doing often enough that they need to be automated.
Removing a Submodule
Although many people seem to consider submodules evil I find that they are a very useful method to keep track of dependencies when those dependencies are not going to change often or your main project doesn't need to track the latest and greatest version of the dependency.
Adding submodules to a project is easy, keeping them up to date is also relatively simple but removing them (when you want to change the location of the repository you are tracking or no longer need the dependency) is a multi step process that I always have to Google the answer for.
Luckily someone has written a simple utility to manage the steps required. You can get it here. The page includes full documentation but the usage is fairly simple:
gitsubmodule remove submodule-name
And that's it. The script will perform all the necessary git commands and update all the appropriate files to remove references to the named submodule from your project. All you need to do is commit and push the changes.
When you migrate a repository to another server and want to update the submodule reference to point to the new server URL you can simply delete the submodule with this tool, commit and then add the submodule back with the new URL. I often do this with projects that move from my internal staging server to something public like GitHub.
Removing a File or Directory
Sometimes simply removing a file from your repository with git remove isn't enough - you want to delete the complete history of the file and any reference to it in your repository.
Although this seems to defeat the purpose of having a revision control system there are a few cases where it becomes necessary. I'll go through some use cases later in the post, for now lets look at how we actually achieve it.
Once again this operation is a multi step process in Git. And once again on the rare occassion you need to do it you will have to Google for the steps. Luckily someone has put together a very useful script to do the heavy lifting for us. I've duplicated it here:
``` #!/bin/bash set -o errexit
# Author: David Underhill # Script to permanently delete files/folders from your git repository. To use # it, cd to your repository's root and then run the script with a list of paths # you want to delete, e.g., git-delete-history path1 path2 if [ $# -eq 0 ]; then exit 0 fi # make sure we're at the root of git repo if [ ! -d .git ]; then echo "Error: must run this script from the root of a git repository" exit 1 fi # remove all paths passed as arguments from the history of the repo files=$@ git filter-branch --prune-empty --index-filter "git rm -rf --cached --ignore-unmatch $files" HEAD # remove the temporary history git-filter-branch otherwise leaves behind for a long time rm -rf .git/refs/original/ && git reflog expire --all && git gc --aggressive --prune
One slight change I did make to script is to add the --prune-empty option to git filter-branch command. This will remove any commits that refer to files that no longer exist. The original script will remove the files but left a lot of empty commits behind that still appeared in the git log output.
Simply save the script to a file called gitpurge.sh and make sure it is on your path. You can then use it like this:
gitpurge.sh file1 file2 path1
This will remove the entire history for all the files and directories named in the parameters. If you are pushing these changes to a remote repository you will have to use the --force option to the push command like this:
git push origin --force
WARNING: If you do this anyone that has a clone of the repository will need to delete it and clone the repository again. This operation is not recommended for widely used repositories.
So why would you want to do this in the first place? Two of the most common reasons are discussed below.
Removing Sensitive or Large Files
If you have accidently checked in a large binary file (or, by poor choice decided to keep large files that change regularly in your repository) you will want to remove it from the history as well as removing it from the current branch.
One project I worked on had a massive history of large (> 20Mb) flash images stored in the repository - these were generated every week and the new version committed and pushed to the remote master repository. Git operations on this repository were painfully slow. Finding a different way of storing this files and carefully tagging the repository so older revisions could be quickly rebuilt meant that this step could be avoided. This didn't solve the problem of a huge repository though so I deleted all the history related to those images and shrunk the repository back to a reasonable size.
Another reason for wanting the history purged is if you accidently check in a file containing sensitive information such as a SSH private key. Simply deleting the file using git rm isn't enough - you need to completely purge the history so no copy of the file remains in the repository.
Breaking a Project into Submodules
Sometimes you realise that part of a project could be reused in other projects and should be split off into it's own repository as a separate library or module that can be easily shared. A recent example for me is the UI framework I designed for the Bench Tester project. This part of the project will be very useful in a number of other projects so I split it out as it's own project.
So how do you do this while keeping the full history of changes you've made? Lets look at a sample project with three child components. The directory structure is something like this:
/project /.git /component1 /component2 /component3
What we want to do is turn component2 into it's own repository and then replace the component2 directory in the project with a submodule reference.
The first step is to make another clone of your main project and remove the remote reference from. You can do that like this:
/project$> cd .. /$> git clone --no-hardlinks project project2 /$> cd project2 /project2$> git remote rm origin
Now we want to turn project2 into a repository that contains everything under component2 in the root directory. Then we want to push that up to our remote repository.
/project2$> # Remove everything except component2 /project2$> gitpurge.sh component1 component3 /project2$> # Move the contents of component2 to the root directory /project2$> git filter-branch --prune-empty --subdirectory-filter component2 HEAD /project2$> git reset --hard /project2$> rm -rf .git/refs/original/ /project2$> git reflog expire --expire=now --all /project2$> git gc --aggressive --prune=now /project2$> # Set the new remote and push the changes /project2$> git remote add origin git@mygitserver:/component2 /project2$> git push origin master
Now we need to go back to our original repository and add the new submodule.
/project2$> cd ../project /project$> gitpurge.sh component2 /project$> git submodule add git@mygitserver:/component2 component2 /project$> git commit /project$> git submodule init /project$> git submodule update /project$> git commit /project$> git push origin --force
The directory structure for the main project will now look like this:
/project /.git /component1 /component2 -> git@mygitserver:/component2 /component3
So there you have it. A nice set of helpers to do tasks with Git that, although not common, do come up often enough that they should be automated. I'm starting to build up a set of helper scripts in my local ~/bin directory to do these sort of tasks for me - it comes in very handy when I do need to do them and just want it done quickly.