How does git store duplicate files? -
we have git repository contains svm ai input data , results. every time run new model, create new root folder model can organize our results on time:
/run1.0 /data ... 100 mb of data /classification.csv /results.csv ... /run2.0 /data ... 200 mb of data (including run1.0/data) /classification.csv /results.csv ...
as build new models may pull in data (large .wav files) previous run. means our data folder 2.0 may contain files 1.0/data plus additional data may have collected.
the repo going exceed gigabyte if keep up.
does git have way recognize duplicate binary files , store them once (e.g. symlink)? if not, rework how data stored.
i not going explain quite right understanding every commit stores tree structure representing file structure of project pointers actual files stored in objects sub folder. git uses sha1 hash of file contents create file name , sub folder, example if file's contents created following hash:
ob064b56112cc80495ba59e2ef63ffc9e9ef0c77
it stored as:
.git/objects/ob/064b56112cc80495ba59e2ef63ffc9e9ef0c77
the first 2 characters used directory name , rest file name.
the result if have multiple files same contents different names or in different locations or different commits 1 copy ever saved several pointers in each commit tree.
Comments
Post a Comment