vb.net Is there a faster way to substract one list of string from another

Public done As New List(Of String)
Public thinkingofdoing As New List(Of String)
Public todo As New List(Of String)

done.AddRange(System.IO.File.ReadAllLines("C:\Users\Work\Desktop\done.txt"))
thinkingofdoing.AddRange(System.IO.File.ReadAllLines("C:\Users\Work\Desktop\thinkingofdoing.txt"))

For i = 0 To thinkingofdoing.Count - 1
    ThreadPool.QueueUserWorkItem(AddressOf caldiff, thinkingofdoing(i))
Next

Public Sub caldiff(ByVal tobedone)
    If done.Contains(tobedone) = False Then
        todo.Add(tobedone)
    End If
End Sub

done.txt and thinkingofdoing.txt have anywhere from 5 million to 8 million lines

It's taking very long :(, even with a quad core AMD 965 overclocked to 4.2 GHZ.

Answers


First off, the above code is not valid. List(Of T) is not thread safe, so doing this from multiple threads will actually cause significant problems without synchronization, as the calls to Add and Contains aren't themselves safe to be called from multiple threads.

A better option would be to choose better collections, such as HashSet(Of T), which would cause the checks to be far faster. I would recommend something like:

public Done as New HashSet(Of String)
public ThinkingOfDoing as IList(Of String) 
public Todo as New List(Of String)

ThinkingOfDoing = System.IO.File.ReadAllLines("C:\Users\Work\Desktop\thinkingofdoing.txt")
Done.AddRange(System.IO.File.ReadAllLines("C:\Users\Work\Desktop\done.txt"))

ToDo = ThinkingOfDoing.Where(Function(i) Done.Contains(i) = False).ToList()

By using a HashSet(Of T), the Contains() check will become far faster (O(1) instead of O(n)), which will cause this to run a lot faster, even single threaded.

If you don't need to store Done, you could just keep the array, and use Enumerable.Except directly (which uses a Set internally):

ThinkingOfDoing = System.IO.File.ReadAllLines("C:\Users\Work\Desktop\thinkingofdoing.txt")
Dim done = System.IO.File.ReadAllLines("C:\Users\Work\Desktop\done.txt")

Dim Todo = ThinkingOfDoing.Except(done).ToList();

You can use Enumerable.Except which should be much more efficient because it's implemented asHashSet<T> :

IEnumerable(Of String) newLines = thinkingofdoing.Except(done)

You should also use File.ReadLines instead of File.ReadAllLines since the former uses streams whereas the latter loads all into memory at once.

I would test the performance without using ThreadPool first.


how about this ...

Public done As ISet(Of String) 
Public toDo As New List(Of String)(); 

done = New HashSet(Of String) _
    (System.IO.File.ReadAllLine("C:\Users\Work\Desktop\done.txt")

Using reader As New StreamReader(New FileStream _
        ("C:\Users\Work\Desktop\thinkingofdoing.txt"), FileMode.Open)
    Do While reader.Peek() >= 0
        Dim line = reader.ReadLine()
        If Not done.Contains(line) Then
            toDo.Add(line)
        EndIf
    Loop
End Using

This loads all the done lines into a HashSet which has excellent lookup performance, then instead of loading the whole of the thinking of doing file into memory it gets parsed line by line and only added to todo if it is not already in done.

If VB.Net had a yield return I would have put that in a function and done ToList on the IEnumerable but hey ho.


Need Your Help

AjaxFileUpload fails on UploadComplete

asp.net vb.net asp.net-ajax

I'm having trouble trying to get the AjaxFileUpload to complete properly. Here's what happens:

What is wrong with this Random String Generator?

php arrays string random generator

I have edited some code that I found online to generate a random string in PHP which I am going to use as an ID.