Excel - String Remove Duplicates
I am working with some UK address data which within an Excel cell is split into its constituent parts by a comma.
I have some VBA which I've taken from the web which has removed a number of exact duplicated entries but I am left with a large amount of data which has repeating segments some sequentially and some non sequentially.
Attached is an image highlighting what I am trying to achieve, the code I have used thus far which is not mine is included to show you the direction in which I have been looking. Anyone have any further thoughts on how this can be achieved?
Function stringOfUniques(inputString As String, delimiter As String) Dim xVal As Variant Dim dict As Object Set dict = CreateObject("Scripting.Dictionary") For Each xVal In Split(inputString, delimiter) dict(xVal) = xVal Next xVal stringOfUniques = Join(dict.Keys(), ",") End Function
This did manage to get rid of a number of them but there is a huge population that I am working on so automating this would be incredible.
Possibly not the most elegant answer, but this does the trick. Here I use the Split command to split the string at each comma. The result returned from this is
bat ball banana
Option Explicit Private Sub test() Dim Mystring As String Dim StrResult As String Mystring = "bat,ball,bat,ball,banana" StrResult = shed_duplicates(Mystring) End Sub Private Function shed_duplicates(ByRef Mystring As String) As String Dim MySplitz() As String Dim J As Integer Dim K As Integer Dim BooMatch As Boolean Dim StrTemp(10) As String ' assumes no more than 10 possible splits! Dim StrResult As String MySplitz = Split(Mystring, ",") For J = 0 To UBound(MySplitz) BooMatch = False For K = 0 To UBound(StrTemp) If MySplitz(J) = StrTemp(K) Then BooMatch = True Exit For End If Next K If Not BooMatch Then StrTemp(J) = MySplitz(J) End If Next For J = 0 To UBound(StrTemp) If Len(StrTemp(J)) > 0 Then ' ignore blank entries StrResult = StrResult + StrTemp(J) + " " End If Next J Debug.Print StrResult End Function
You may really use a regex replacement:
The replacement pattern is
See the regex demo. The pattern explanation:
- ^ - start of a string (or of a line if .MultiLine = True)
- (\d*\s*([^,]*),.*) - Group 1 (later referenced to with $1 backreference from the replacement pattern) matching:
- \d* - 0+ digits followed with
- \s* - 0+ whitespace characters
- ([^,]*) - Group 2 (later we can use \2 in-pattern backreference to refer to the value captured with this subpattern) matching 0+ characters other than a comma
- ,.* - a comma followed with 0+ characters other than a newline
- \2 - the text captured by Group 2
- (,|$) - Group 3 (later referenced to with $3 from the replacement pattern - to restore the comma) matching either a comma or the end of string (or line if .MultiLine = True).
NOTE: You do not need .MultiLine = True if you just check individual cells with containing one address.
Below is a sample VBA Sub showing how this can be used in VBA:
Sub test() Dim regEx As Object Set regEx = CreateObject("VBScript.RegExp") With regEx .pattern = "^(\d*\s*([^,]*),.*)\2(,|$)" .Global = True .MultiLine = True ' Remove if individual addresses are matched End With s = "66 LAUSANNE ROAD,LAUSANNE ROAD,HORNSEY" & vbCrLf & _ "9 CARNELL LANE,CARNELL LANE,FERNWOOD" & vbCrLf & _ "35 FLAT ANDERSON HEIGHTS,1001 LONDON ROAD,FLAT ANDERSON HEIGHTS" & vbCrLf & _ "27 RUSSELL BANK ROAD,RUSSEL BANK,SUTTON COLDFIELD" MsgBox regEx.Replace(s, "$1$3") End Sub
A first solution would be to use a dictionary to get a list of unique segments. It would then be as simple as skipping the first address number before splitting the segments:
Function RemoveDuplicates1(text As String) As String Static dict As Object If dict Is Nothing Then Set dict = CreateObject("Scripting.Dictionary") dict.CompareMode = 1 ' set the case sensitivity to All Else dict.RemoveAll End If ' Get the position just after the address number Dim c&, istart&, segment For istart = 1 To Len(text) c = Asc(Mid$(text, istart, 1)) If (c < 48 Or c > 57) And c <> 32 Then Exit For ' if not [0-9 ] Next ' Split the segments and add each one of them to the dictionary. No need to keep ' a reference to each segment since the keys are returned by order of insertion. For Each segment In Split(Mid$(text, istart), ",") If Len(segment) Then dict(segment) = Empty Next ' Return the address number and the segments by joining the keys RemoveDuplicates1 = Mid$(text, 1, istart - 1) & Join(dict.keys(), ",") End Function
A second solution would be to extract all the segments and then search if each one of them is present at a previous position:
Function RemoveDuplicates2(text As String) As String Dim c&, segments$, segment$, length&, ifirst&, istart&, iend& ' Get the position just after the address number For ifirst = 1 To Len(text) c = Asc(Mid$(text, ifirst, 1)) If (c < 48 Or c > 57) And c <> 32 Then Exit For ' if not [0-9 ] Next ' Get the segments without the address number and add a leading/trailing comma segments = "," & Mid$(text, ifirst) & "," istart = 1 ' iterate each segment Do While istart < Len(segments) ' Get the next segment position iend = InStr(istart + 1, segments, ",") - 1 And &HFFFFFF If iend - istart Then ' Get the segment segment = Mid$(segments, istart, iend - istart + 2) ' Rewrite the segment if not present at a previous position If InStr(1, segments, segment, vbTextCompare) = istart Then Mid$(segments, length + 1) = segment length = length + Len(segment) - 1 End If End If istart = iend + 1 Loop ' Return the address number and the segments RemoveDuplicates2 = Mid$(text, 1, ifirst - 1) & Mid$(segments, 2, length - 1) End Function
And a third solution would be to use a regular expression to remove all the duplicated segments:
Function RemoveDuplicates3(ByVal text As String) As String Static re As Object If re Is Nothing Then Set re = CreateObject("VBScript.RegExp") re.Global = True re.IgnoreCase = True ' Match any duplicated segment separated by a comma. ' The first segment is compared without the first digits. re.Pattern = "((^\d* *|,)([^,]+)(?=,).*),\3?(?=,|$)" End If ' Remove each matching segment Do While re.test(text) text = re.Replace(text, "$1") Loop RemoveDuplicates3 = text End Function
These are the execution times for 10000 iterations (the lower the better):
input text : "123 abc,,1 abc,abc 2,ABC,abc,a,c" output text : "123 abc,1 abc,abc 2,a,c" RemoveDuplicates1 (dictionary) : 718 ms RemoveDuplicates2 (text search) : 219 ms RemoveDuplicates3 (regex) : 1469 ms