vb.net Is there a faster way to substract one list of string from another
Public done As New List(Of String) Public thinkingofdoing As New List(Of String) Public todo As New List(Of String) done.AddRange(System.IO.File.ReadAllLines("C:\Users\Work\Desktop\done.txt")) thinkingofdoing.AddRange(System.IO.File.ReadAllLines("C:\Users\Work\Desktop\thinkingofdoing.txt")) For i = 0 To thinkingofdoing.Count - 1 ThreadPool.QueueUserWorkItem(AddressOf caldiff, thinkingofdoing(i)) Next Public Sub caldiff(ByVal tobedone) If done.Contains(tobedone) = False Then todo.Add(tobedone) End If End Sub
done.txt and thinkingofdoing.txt have anywhere from 5 million to 8 million lines
It's taking very long :(, even with a quad core AMD 965 overclocked to 4.2 GHZ.
First off, the above code is not valid. List(Of T) is not thread safe, so doing this from multiple threads will actually cause significant problems without synchronization, as the calls to Add and Contains aren't themselves safe to be called from multiple threads.
A better option would be to choose better collections, such as HashSet(Of T), which would cause the checks to be far faster. I would recommend something like:
public Done as New HashSet(Of String) public ThinkingOfDoing as IList(Of String) public Todo as New List(Of String) ThinkingOfDoing = System.IO.File.ReadAllLines("C:\Users\Work\Desktop\thinkingofdoing.txt") Done.AddRange(System.IO.File.ReadAllLines("C:\Users\Work\Desktop\done.txt")) ToDo = ThinkingOfDoing.Where(Function(i) Done.Contains(i) = False).ToList()
By using a HashSet(Of T), the Contains() check will become far faster (O(1) instead of O(n)), which will cause this to run a lot faster, even single threaded.
If you don't need to store Done, you could just keep the array, and use Enumerable.Except directly (which uses a Set internally):
ThinkingOfDoing = System.IO.File.ReadAllLines("C:\Users\Work\Desktop\thinkingofdoing.txt") Dim done = System.IO.File.ReadAllLines("C:\Users\Work\Desktop\done.txt") Dim Todo = ThinkingOfDoing.Except(done).ToList();
You can use Enumerable.Except which should be much more efficient because it's implemented asHashSet<T> :
IEnumerable(Of String) newLines = thinkingofdoing.Except(done)
You should also use File.ReadLines instead of File.ReadAllLines since the former uses streams whereas the latter loads all into memory at once.
I would test the performance without using ThreadPool first.
how about this ...
Public done As ISet(Of String) Public toDo As New List(Of String)(); done = New HashSet(Of String) _ (System.IO.File.ReadAllLine("C:\Users\Work\Desktop\done.txt") Using reader As New StreamReader(New FileStream _ ("C:\Users\Work\Desktop\thinkingofdoing.txt"), FileMode.Open) Do While reader.Peek() >= 0 Dim line = reader.ReadLine() If Not done.Contains(line) Then toDo.Add(line) EndIf Loop End Using
This loads all the done lines into a HashSet which has excellent lookup performance, then instead of loading the whole of the thinking of doing file into memory it gets parsed line by line and only added to todo if it is not already in done.
If VB.Net had a yield return I would have put that in a function and done ToList on the IEnumerable but hey ho.