Comparing Two Files After File Copy - Performance Improvements?












0















I've built a file copying routine into a common library for a variety of different (WinForms) applications I'm currently working on. What I've built implements the commonly-used CopyFileEx method to actually perform the file copy while displaying the progress, which seems to be working great.



The only real issue I'm encountering is that, because most of the file copying I'm doing is for archival purposes, once the file is copied, I would like to "verify" the new copy of the file. I have the following methods in place to do the comparison/verification. I'm sure many of you will quickly see where the "problem" is:



Public Shared Function CompareFiles(ByVal File1 As IO.FileInfo, ByVal File2 As IO.FileInfo) As Boolean
Dim Match As Boolean = False

If File1.FullName = File2.FullName Then
Match = True
Else
If File.Exists(File1.FullName) AndAlso File.Exists(File2.FullName) Then
If File1.Length = File2.Length Then
If File1.LastWriteTime = File2.LastWriteTime Then
Try
Dim File1Hash As String = HashFileForComparison(File1)
Dim File2Hash As String = HashFileForComparison(File2)

If File1Hash = File2Hash Then
Match = True
End If
Catch ex As Exception
Dim CompareError As New ErrorHandler(ex)

CompareError.LogException()
End Try
End If
End If
End If
End If

Return Match
End Function

Private Shared Function HashFileForComparison(ByVal OriginalFile As IO.FileInfo) As String
Using BufferedFileReader As New IO.BufferedStream(File.OpenRead(OriginalFile.FullName), 1200000)
Using MD5 As New System.Security.Cryptography.MD5CryptoServiceProvider
Dim FileHash As Byte() = MD5.ComputeHash(BufferedFileReader)

Return System.Text.Encoding.Unicode.GetString(FileHash)
End Using
End Using
End Function


This CompareFiles() method checks a few of the "simple" elements first:




  • Is it trying to compare a file to itself? (if so, always return True)

  • Do both files actually exist?

  • Are the two files the same size?

  • Do they both have the same modification date?


But, you guessed it, here's where the performance takes the hit. Especially for large files, the MD5.ComputeHash method of the HashFileForComparison() method can take a while - about 1.25 minutes for a 500MB file for a total of about 2.5 minutes to compute both hashes for the comparison. Does anyone have a better suggestion for how to more efficiently verify the new copy of the file?










share|improve this question




















  • 1





    Why not compare the files directly instead of first computing hashes and then comparing those? For only two files, you’re not saving any work. It also makes no sense to convert the MD5 hash to a string before comparing it. Apart from that you can make your code more readable by rewriting it to exit the function as soon as possible, rather than nesting your If statements. And lastly, writing If ‹condition› then Variable = True is an anti-pattern. Just write Variable = ‹condition›.

    – Konrad Rudolph
    Nov 13 '18 at 17:14













  • To be clear, is your suggestion to eliminate the hash comparison entirely? My main intention is to do my best to ensure that the "archive" copy is accurate in case I need to restore it later. Comparing the file size and modification date should generally indicate that the files are "the same", but I want to leave out as much room for error as possible.

    – G_Hosa_Phat
    Nov 13 '18 at 17:20






  • 2





    One of my suggestions is to remove the hash comparison, yes. It’s a detour. Just open both files and compare their contents in chunks. That way you avoid the (somewhat costly) hash computation.

    – Konrad Rudolph
    Nov 13 '18 at 17:29













  • I've tried using a couple of the methods suggested in the thread below but it seems that my current MD5 implementation is actually outperforming them. Even when I "tweak" the buffer sizes and such, the hash method I'm using is still a few seconds (or more) faster with this 500MB file. stackoverflow.com/questions/1358510/…

    – G_Hosa_Phat
    Nov 13 '18 at 20:45













  • I may be missing something here, but your function HashFileForComparison has the code Using BufferedFileReader .., but you don't use BufferedFileReader you refer to FileReader in your sample - is this a typo in the pasted sample or a mistake in your code?

    – David Wilson
    Nov 15 '18 at 11:29
















0















I've built a file copying routine into a common library for a variety of different (WinForms) applications I'm currently working on. What I've built implements the commonly-used CopyFileEx method to actually perform the file copy while displaying the progress, which seems to be working great.



The only real issue I'm encountering is that, because most of the file copying I'm doing is for archival purposes, once the file is copied, I would like to "verify" the new copy of the file. I have the following methods in place to do the comparison/verification. I'm sure many of you will quickly see where the "problem" is:



Public Shared Function CompareFiles(ByVal File1 As IO.FileInfo, ByVal File2 As IO.FileInfo) As Boolean
Dim Match As Boolean = False

If File1.FullName = File2.FullName Then
Match = True
Else
If File.Exists(File1.FullName) AndAlso File.Exists(File2.FullName) Then
If File1.Length = File2.Length Then
If File1.LastWriteTime = File2.LastWriteTime Then
Try
Dim File1Hash As String = HashFileForComparison(File1)
Dim File2Hash As String = HashFileForComparison(File2)

If File1Hash = File2Hash Then
Match = True
End If
Catch ex As Exception
Dim CompareError As New ErrorHandler(ex)

CompareError.LogException()
End Try
End If
End If
End If
End If

Return Match
End Function

Private Shared Function HashFileForComparison(ByVal OriginalFile As IO.FileInfo) As String
Using BufferedFileReader As New IO.BufferedStream(File.OpenRead(OriginalFile.FullName), 1200000)
Using MD5 As New System.Security.Cryptography.MD5CryptoServiceProvider
Dim FileHash As Byte() = MD5.ComputeHash(BufferedFileReader)

Return System.Text.Encoding.Unicode.GetString(FileHash)
End Using
End Using
End Function


This CompareFiles() method checks a few of the "simple" elements first:




  • Is it trying to compare a file to itself? (if so, always return True)

  • Do both files actually exist?

  • Are the two files the same size?

  • Do they both have the same modification date?


But, you guessed it, here's where the performance takes the hit. Especially for large files, the MD5.ComputeHash method of the HashFileForComparison() method can take a while - about 1.25 minutes for a 500MB file for a total of about 2.5 minutes to compute both hashes for the comparison. Does anyone have a better suggestion for how to more efficiently verify the new copy of the file?










share|improve this question




















  • 1





    Why not compare the files directly instead of first computing hashes and then comparing those? For only two files, you’re not saving any work. It also makes no sense to convert the MD5 hash to a string before comparing it. Apart from that you can make your code more readable by rewriting it to exit the function as soon as possible, rather than nesting your If statements. And lastly, writing If ‹condition› then Variable = True is an anti-pattern. Just write Variable = ‹condition›.

    – Konrad Rudolph
    Nov 13 '18 at 17:14













  • To be clear, is your suggestion to eliminate the hash comparison entirely? My main intention is to do my best to ensure that the "archive" copy is accurate in case I need to restore it later. Comparing the file size and modification date should generally indicate that the files are "the same", but I want to leave out as much room for error as possible.

    – G_Hosa_Phat
    Nov 13 '18 at 17:20






  • 2





    One of my suggestions is to remove the hash comparison, yes. It’s a detour. Just open both files and compare their contents in chunks. That way you avoid the (somewhat costly) hash computation.

    – Konrad Rudolph
    Nov 13 '18 at 17:29













  • I've tried using a couple of the methods suggested in the thread below but it seems that my current MD5 implementation is actually outperforming them. Even when I "tweak" the buffer sizes and such, the hash method I'm using is still a few seconds (or more) faster with this 500MB file. stackoverflow.com/questions/1358510/…

    – G_Hosa_Phat
    Nov 13 '18 at 20:45













  • I may be missing something here, but your function HashFileForComparison has the code Using BufferedFileReader .., but you don't use BufferedFileReader you refer to FileReader in your sample - is this a typo in the pasted sample or a mistake in your code?

    – David Wilson
    Nov 15 '18 at 11:29














0












0








0








I've built a file copying routine into a common library for a variety of different (WinForms) applications I'm currently working on. What I've built implements the commonly-used CopyFileEx method to actually perform the file copy while displaying the progress, which seems to be working great.



The only real issue I'm encountering is that, because most of the file copying I'm doing is for archival purposes, once the file is copied, I would like to "verify" the new copy of the file. I have the following methods in place to do the comparison/verification. I'm sure many of you will quickly see where the "problem" is:



Public Shared Function CompareFiles(ByVal File1 As IO.FileInfo, ByVal File2 As IO.FileInfo) As Boolean
Dim Match As Boolean = False

If File1.FullName = File2.FullName Then
Match = True
Else
If File.Exists(File1.FullName) AndAlso File.Exists(File2.FullName) Then
If File1.Length = File2.Length Then
If File1.LastWriteTime = File2.LastWriteTime Then
Try
Dim File1Hash As String = HashFileForComparison(File1)
Dim File2Hash As String = HashFileForComparison(File2)

If File1Hash = File2Hash Then
Match = True
End If
Catch ex As Exception
Dim CompareError As New ErrorHandler(ex)

CompareError.LogException()
End Try
End If
End If
End If
End If

Return Match
End Function

Private Shared Function HashFileForComparison(ByVal OriginalFile As IO.FileInfo) As String
Using BufferedFileReader As New IO.BufferedStream(File.OpenRead(OriginalFile.FullName), 1200000)
Using MD5 As New System.Security.Cryptography.MD5CryptoServiceProvider
Dim FileHash As Byte() = MD5.ComputeHash(BufferedFileReader)

Return System.Text.Encoding.Unicode.GetString(FileHash)
End Using
End Using
End Function


This CompareFiles() method checks a few of the "simple" elements first:




  • Is it trying to compare a file to itself? (if so, always return True)

  • Do both files actually exist?

  • Are the two files the same size?

  • Do they both have the same modification date?


But, you guessed it, here's where the performance takes the hit. Especially for large files, the MD5.ComputeHash method of the HashFileForComparison() method can take a while - about 1.25 minutes for a 500MB file for a total of about 2.5 minutes to compute both hashes for the comparison. Does anyone have a better suggestion for how to more efficiently verify the new copy of the file?










share|improve this question
















I've built a file copying routine into a common library for a variety of different (WinForms) applications I'm currently working on. What I've built implements the commonly-used CopyFileEx method to actually perform the file copy while displaying the progress, which seems to be working great.



The only real issue I'm encountering is that, because most of the file copying I'm doing is for archival purposes, once the file is copied, I would like to "verify" the new copy of the file. I have the following methods in place to do the comparison/verification. I'm sure many of you will quickly see where the "problem" is:



Public Shared Function CompareFiles(ByVal File1 As IO.FileInfo, ByVal File2 As IO.FileInfo) As Boolean
Dim Match As Boolean = False

If File1.FullName = File2.FullName Then
Match = True
Else
If File.Exists(File1.FullName) AndAlso File.Exists(File2.FullName) Then
If File1.Length = File2.Length Then
If File1.LastWriteTime = File2.LastWriteTime Then
Try
Dim File1Hash As String = HashFileForComparison(File1)
Dim File2Hash As String = HashFileForComparison(File2)

If File1Hash = File2Hash Then
Match = True
End If
Catch ex As Exception
Dim CompareError As New ErrorHandler(ex)

CompareError.LogException()
End Try
End If
End If
End If
End If

Return Match
End Function

Private Shared Function HashFileForComparison(ByVal OriginalFile As IO.FileInfo) As String
Using BufferedFileReader As New IO.BufferedStream(File.OpenRead(OriginalFile.FullName), 1200000)
Using MD5 As New System.Security.Cryptography.MD5CryptoServiceProvider
Dim FileHash As Byte() = MD5.ComputeHash(BufferedFileReader)

Return System.Text.Encoding.Unicode.GetString(FileHash)
End Using
End Using
End Function


This CompareFiles() method checks a few of the "simple" elements first:




  • Is it trying to compare a file to itself? (if so, always return True)

  • Do both files actually exist?

  • Are the two files the same size?

  • Do they both have the same modification date?


But, you guessed it, here's where the performance takes the hit. Especially for large files, the MD5.ComputeHash method of the HashFileForComparison() method can take a while - about 1.25 minutes for a 500MB file for a total of about 2.5 minutes to compute both hashes for the comparison. Does anyone have a better suggestion for how to more efficiently verify the new copy of the file?







vb.net performance md5






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 15 '18 at 14:29







G_Hosa_Phat

















asked Nov 13 '18 at 17:12









G_Hosa_PhatG_Hosa_Phat

287416




287416








  • 1





    Why not compare the files directly instead of first computing hashes and then comparing those? For only two files, you’re not saving any work. It also makes no sense to convert the MD5 hash to a string before comparing it. Apart from that you can make your code more readable by rewriting it to exit the function as soon as possible, rather than nesting your If statements. And lastly, writing If ‹condition› then Variable = True is an anti-pattern. Just write Variable = ‹condition›.

    – Konrad Rudolph
    Nov 13 '18 at 17:14













  • To be clear, is your suggestion to eliminate the hash comparison entirely? My main intention is to do my best to ensure that the "archive" copy is accurate in case I need to restore it later. Comparing the file size and modification date should generally indicate that the files are "the same", but I want to leave out as much room for error as possible.

    – G_Hosa_Phat
    Nov 13 '18 at 17:20






  • 2





    One of my suggestions is to remove the hash comparison, yes. It’s a detour. Just open both files and compare their contents in chunks. That way you avoid the (somewhat costly) hash computation.

    – Konrad Rudolph
    Nov 13 '18 at 17:29













  • I've tried using a couple of the methods suggested in the thread below but it seems that my current MD5 implementation is actually outperforming them. Even when I "tweak" the buffer sizes and such, the hash method I'm using is still a few seconds (or more) faster with this 500MB file. stackoverflow.com/questions/1358510/…

    – G_Hosa_Phat
    Nov 13 '18 at 20:45













  • I may be missing something here, but your function HashFileForComparison has the code Using BufferedFileReader .., but you don't use BufferedFileReader you refer to FileReader in your sample - is this a typo in the pasted sample or a mistake in your code?

    – David Wilson
    Nov 15 '18 at 11:29














  • 1





    Why not compare the files directly instead of first computing hashes and then comparing those? For only two files, you’re not saving any work. It also makes no sense to convert the MD5 hash to a string before comparing it. Apart from that you can make your code more readable by rewriting it to exit the function as soon as possible, rather than nesting your If statements. And lastly, writing If ‹condition› then Variable = True is an anti-pattern. Just write Variable = ‹condition›.

    – Konrad Rudolph
    Nov 13 '18 at 17:14













  • To be clear, is your suggestion to eliminate the hash comparison entirely? My main intention is to do my best to ensure that the "archive" copy is accurate in case I need to restore it later. Comparing the file size and modification date should generally indicate that the files are "the same", but I want to leave out as much room for error as possible.

    – G_Hosa_Phat
    Nov 13 '18 at 17:20






  • 2





    One of my suggestions is to remove the hash comparison, yes. It’s a detour. Just open both files and compare their contents in chunks. That way you avoid the (somewhat costly) hash computation.

    – Konrad Rudolph
    Nov 13 '18 at 17:29













  • I've tried using a couple of the methods suggested in the thread below but it seems that my current MD5 implementation is actually outperforming them. Even when I "tweak" the buffer sizes and such, the hash method I'm using is still a few seconds (or more) faster with this 500MB file. stackoverflow.com/questions/1358510/…

    – G_Hosa_Phat
    Nov 13 '18 at 20:45













  • I may be missing something here, but your function HashFileForComparison has the code Using BufferedFileReader .., but you don't use BufferedFileReader you refer to FileReader in your sample - is this a typo in the pasted sample or a mistake in your code?

    – David Wilson
    Nov 15 '18 at 11:29








1




1





Why not compare the files directly instead of first computing hashes and then comparing those? For only two files, you’re not saving any work. It also makes no sense to convert the MD5 hash to a string before comparing it. Apart from that you can make your code more readable by rewriting it to exit the function as soon as possible, rather than nesting your If statements. And lastly, writing If ‹condition› then Variable = True is an anti-pattern. Just write Variable = ‹condition›.

– Konrad Rudolph
Nov 13 '18 at 17:14







Why not compare the files directly instead of first computing hashes and then comparing those? For only two files, you’re not saving any work. It also makes no sense to convert the MD5 hash to a string before comparing it. Apart from that you can make your code more readable by rewriting it to exit the function as soon as possible, rather than nesting your If statements. And lastly, writing If ‹condition› then Variable = True is an anti-pattern. Just write Variable = ‹condition›.

– Konrad Rudolph
Nov 13 '18 at 17:14















To be clear, is your suggestion to eliminate the hash comparison entirely? My main intention is to do my best to ensure that the "archive" copy is accurate in case I need to restore it later. Comparing the file size and modification date should generally indicate that the files are "the same", but I want to leave out as much room for error as possible.

– G_Hosa_Phat
Nov 13 '18 at 17:20





To be clear, is your suggestion to eliminate the hash comparison entirely? My main intention is to do my best to ensure that the "archive" copy is accurate in case I need to restore it later. Comparing the file size and modification date should generally indicate that the files are "the same", but I want to leave out as much room for error as possible.

– G_Hosa_Phat
Nov 13 '18 at 17:20




2




2





One of my suggestions is to remove the hash comparison, yes. It’s a detour. Just open both files and compare their contents in chunks. That way you avoid the (somewhat costly) hash computation.

– Konrad Rudolph
Nov 13 '18 at 17:29







One of my suggestions is to remove the hash comparison, yes. It’s a detour. Just open both files and compare their contents in chunks. That way you avoid the (somewhat costly) hash computation.

– Konrad Rudolph
Nov 13 '18 at 17:29















I've tried using a couple of the methods suggested in the thread below but it seems that my current MD5 implementation is actually outperforming them. Even when I "tweak" the buffer sizes and such, the hash method I'm using is still a few seconds (or more) faster with this 500MB file. stackoverflow.com/questions/1358510/…

– G_Hosa_Phat
Nov 13 '18 at 20:45







I've tried using a couple of the methods suggested in the thread below but it seems that my current MD5 implementation is actually outperforming them. Even when I "tweak" the buffer sizes and such, the hash method I'm using is still a few seconds (or more) faster with this 500MB file. stackoverflow.com/questions/1358510/…

– G_Hosa_Phat
Nov 13 '18 at 20:45















I may be missing something here, but your function HashFileForComparison has the code Using BufferedFileReader .., but you don't use BufferedFileReader you refer to FileReader in your sample - is this a typo in the pasted sample or a mistake in your code?

– David Wilson
Nov 15 '18 at 11:29





I may be missing something here, but your function HashFileForComparison has the code Using BufferedFileReader .., but you don't use BufferedFileReader you refer to FileReader in your sample - is this a typo in the pasted sample or a mistake in your code?

– David Wilson
Nov 15 '18 at 11:29












0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53286296%2fcomparing-two-files-after-file-copy-performance-improvements%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53286296%2fcomparing-two-files-after-file-copy-performance-improvements%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Florida Star v. B. J. F.

Danny Elfman

Retrieve a Users Dashboard in Tumblr with R and TumblR. Oauth Issues