PyFiSync
Python (+ rsync or rclone) based intelligent file sync with automatic backups and file move/delete tracking.
Features
- Robust tracking of file moves
- Especially powerful on MacOS, but works well enough on linux.
- rsync Mode:
- Works out of the box with Python (tested on 2.7 and 3.5+) for rsync
- Works over SSH for secure and easy connections with rsync mode
- Uses rsync for actual file transfers to save bandwidth and make use of existing file data
- rclone mode: (beta!)
- Can connect to a wide variety of cloud-services and offers encryption
- Note that rclone is still supported and works but it is better to use syncrclone instead.
- rclone support may be deprecated in the future!
- Extensively tested for a huge variety of edge-cases
Details
PyFiSync uses a small database of files from the last sync to track moves and deletions (based on changeable attributes such as inode numbers, sha1 hashes, and/or create time). It then compares mtime
from both sides on all files to decide on transfers.
Backups
By default, any time a file is to be overwritten or modified, it is backed up on the machine first. No distinction is made in the backup for overwrite vs delete.
Attributes
Moves and deletions are tracked via attributes described below.
Move attributed are used to track if a file has moved while the prev_attributes
are used to determine if a file is the same as before
Note: On HFS+ (and maybe APFS?), macOS's file system, inodes are not reused quickly. On ext3 (Linux) they are recycled rapidly leading to issues when files are deleted and new ones are made. Do not use inodes alone on these systems
Common attributes
path
-- This essentially means that moves are not tracked. If a file has the same name, it is considered the same filesize
-- File size. Do not use alone. Also, this attribute means that the file may not change between moves. See examples belowmtime
-- When the file was modified. Use withino
to track files
rsync and local attributes
Attributes for the local machine and an rsync remote
ino
(inode number)-- Track the filesystem inode number. May be safely used alone on HFS+ but not on ext3 since it reuses inodes. In that case, use with another attribute- hashes -- Very robust to track file moves but like
size
, requires the file not change. Also, slow to calculate (though, by default, they are not recalculated on every sync). Options:adler
-- Fast but less securedbhash
-- Used for dropbox. Useful if comparing on hash- any
hashlib.algorithms_guaranteed
:sha384
,sha3_224
,sha3_512
,md5
,sha512
,sha3_256
,blake2b
,sha3_384
,shake_128
,blake2s
,sha256
,shake_256
,sha1
,sha224
birthtime
-- Use the file create time. This does not exist on some linux machines, some python implementations (PyPy), and/or is unreliable
rclone attributes
hash.HASH
-- Use a hash from rclone. Depends on which hashes are available.
Suggested move Attribute Combinations
For rsync
- On macOS, the following is suggested:
[ino,birthtime]
- On linux, the following is suggested:
[inode,mtime]
- This means that moved files should not be modified on that side of the sync.
Hashes
As noted, any hashlib.algorithms_guaranteed
is supported for rsync mode and the local machine. In order to save time, a database is used of the previous file. This can be turned off in the config forcing all of the files to be read and hashed again.
Empty Directories
PyFiSync syncs files and therefore will not sync empty directories from one machine to the other. However, if, and only if, a directory is made empty by the sync, it will be deleted. That includes nested directories. In rclone mode, empty directories are not handled at all by PyFiSync
Install
This are no dependancies! (for rsync). Everything is included in the package (though ldtable
is also separately developed here) (now DictTable
)
To install:
$ python -m pip install git+https://github.com/Jwink3101/PyFiSync
Or download the zip file and run
$ python setup.py install
If using the rclone remote (see setup below), install it on the remote machine too.
Note: On the remote machine, the path to PyFiSync must be found via SSH. For example, if your python is from (Ana/Mini)conda, then it places the paths into the .bash_profile
. Move the paths to .bashrc
so that PyFiSync can be found.
Alternatively, specify remote_exe
.
Setup
See rsync for setup of the default mode. PyFiSync must be installed on both machines (or the Python scripts must be there and configured)
Setting up rclone is a bit more involved since you must set up an appropriate rclone remote. See rclone readme for general details and rclone_b2 for a detailed walk through of setting up with B2 (and S3 with small noted changes).
To initiate an rclone-based repo, do
$ PyFiSync init --remote rclone
Settings
There are many settings, all documented in the config file written after an init
. Here are a few:
Exclusions
Exclusion naming is done is such a way that it replicated a subset of rsync
exclusions. That is, the following pattern is what this code follows. rsync
has its own exclusion engine which is more advanced but should be have similarly.
- If an item ends in
/
it is a folder exclusion - If an item starts with
/
it is a full path relative to the root - Wildcards and other patterns are accepted
Pattern | Meaning |
---|---|
* |
matches everything |
? |
matches any single character |
[seq] |
matches any character in seq |
[!seq] |
matches any character not in seq |
Examples:
- Exclude all git directories:
.git/
- Exclude a specific folder:
/path/to/folder/
(where/
is the start of the sync directory - Exclude all files that start with
file
:file*
- Exclude all files that start with
file
in a specific directory:/path/to/file*
Exclude if Present
PyFiSync allows for exclusion of a directory due to the presence of a specified file name (the contents of the file do not matter, only the presence of it).
Unlike regular exclusions which halt traversing deeper into an excluded directory tree, exclude_if_present
is a filter applied after the fact. This approach is safer as adding an exclusion file on one side will not cause a delete to be incorrectly propagated. It does come at a small performance penalty as the excluded directory is is initially traversed
Symlinks
First note that all directory links are followed regardless of setting. Use exclusions to avoid syncing a linked directory.
If copy_symlinks_as_links=False
symlinked files sync their referent (and rsync uses -L
) If True
(default), symlinks copy the link itself (a la how git works)
WARNINGS:
- If
copy_symlinks_as_links = False
and there are symlinked files to another IN sync root, there will be issues with the file tracking. Do not do this! - As also noted in Python's documentation, there is no safeguard against recursively symlinked directories.
- rsync may throw warnings for broken links
- rclone's support of symlinks is unreliable at the moment.
Pre and Post Bash
There is the option to also add some bash scripts pre and post sync. These may be useful if you wish to do a git push, pull, etc either remote or local.
They are ALWAYS executed from the sync root (a cd /path/to/syncroot
is inserted above).
Running Tests
To run the test, in bash, do:
$ source run_test.sh
In addition to testing a whole slew of edge cases, it also will test all actions on a local sync, and remote to both python2 and python3 (via ssh localhost
). The run script will try to call py.test
for both versions of python locally.
Known Issues and Limitations
The test suite is extremely extensive as to cover tons of different and difficult scenarios. See the tests for further exploration of how the code handles these cases. Please note that unless specified explicitly in the config or the command-line flag, all deletions and (future) overwrites first perform a backup. Moves are not backed up but make likely be unwound from the logs.
A few notable limitations are as follows:
- Symlinks are followed (optionally) but if the file they are linking to is also in the sync folder, it may confuse the move tracking
- File move tracking
- A file moved with a new name that is excluded will propagate as deleted. This is expected since the code no longer has a way to "see" the file on the one side.
- A file that is moved on one side and deleted on the other will NOT have the deletion propagated regardless of modification
- Sync is based on modification time metadata. This is fairly robust but could still have issues. In rsync mode, even if PyFiSync decides to sync the files, it may just update the metadata. In that case, you may just want to disable backups. With rclone, it depends on the remote and care should be taken.
There is also a potential issue with the test suite. In order to ensure that the files are noted as changed (since they are all modified so quickly), the times are often adjusted via some random amounts. There is a small chance some tests could fail due to a small number not changing. Running the tests again should pass.
See rclone readme for some rclone-related known issues
Other Questions
See the (growing) FAQ for some more details and/or troubleshooting