ISCB-Asia/SCCG 2012 Proceedings TalkFast Probabilistic File Fingerprinting for Big DataKonstantin Tretjakov1, Sven Laur1, Geert Smant2, Jaak Vilo1 & Pjotr Prins2,3 |
Results: We present an efficient method for calculating file uniqueness for large scientific data files, that takes less computational effort than existing techniques. This method, called Probabilistic Fast File Fingerprinting (PFFF), exploits the variation present in biological data and computes file fingerprints by sampling randomly from the file instead of reading it in full. Consequently, it has a flat performance characteristic, correlated with data variation rather than file size. We demonstrate that probabilistic fingerprinting can be as reliable as existing hashing techniques, with provably negligible risk of collisions. We measure the performance of the algorithm on a number of data storage and access technologies, identifying its strengths as well as limitations.
Conclusions: Probabilistic fingerprinting may significantly reduce the use of computational resources when comparing very large files. Utilisation of probalistic fingerprinting techniques can increase the speed of common file-related workflows, both in the data center and for workbench analysis. The implementation of the algorithm is available as an open-source tool named pfff, as a command-line tool as well as a C library.