Filter only digit sequences containing a given set of digits

I have a large list of digit strings like this one. The individual strings are relatively short (say less than 50 digits).

data = [
  '300303334',
  '53210234',
  '123456789',
  '5374576807063874'
]

I need to find out a efficient data structure (speed first, memory second) and algorithm which returns only those strings that are composed of a given set of digits.

Example results:

filter(data, [0,3,4]) = ['300303334']
filter(data, [0,1,2,3,4,5]) = ['300303334', '53210234']

The data list will usually fit into memory.

Answers


For each digit, precompute a postings list that don't contain the digit.

postings = [[] for _ in xrange(10)]
for i, d in enumerate(data):
    for j in xrange(10):
        digit = str(j)
        if digit not in d:
            postings[j].append(i)

Now, to find all strings that contain, for example, just the digits [1, 3, 5] you can merge the postings lists for the other digits (ie: 0, 2, 4, 6, 7, 8, 9).

def intersect_postings(p0, p1):
    i0, i1 = next(p0), next(p1)
    while True:
         if i0 == i1:
            yield i0
            i0, i1 = next(p0), next(p1)
         elif i0 < i1: i0 = next(p0)
         else: i1 = next(p1)

def find_all(digits):
    p = None
    for d in xrange(10):
        if d not in digits:
            if p is None: p = iter(postings[d])
            else: p = intersect_postings(p, iter(postings[d]))
    return (data[i] for i in p) if p else iter(data)

print list(find_all([0, 3, 4]))
print list(find_all([0, 1, 2, 3, 4, 5]))

A string can be encoded by a 10-bit number. There are 2^10, or 1,024 possible values.

So create a dictionary that uses an integer for a key and a list of strings for the value.

Calculate the value for each string and add that string to the list of strings for that value.

General idea:

Dictionary Lookup;

for each (string in list)
    value = 0;
    for each character in string
        set bit N in value, where N is the character (0-9)
    Lookup[value] += string  // adds string to list for this value in dictionary

Then, to get a list of the strings that match your criteria, just compute the value and do a direct dictionary lookup.

So if the user asks for strings that contain only 3, 5, and 7:

value = (1 << 3) || (1 << 5) || (1 << 7);
list = Lookup[value];

Note that, as Matt pointed out in comment below, this will only return strings that contain all three digits. So, for example, it wouldn't return 37. That seems like a fatal flaw to me.

Edit

If the number of symbols you have to deal with is very large, then the number of possible combinations becomes too large for this solution to be practical.

With a large number of symbols, I'd recommend an inverted index as suggested in the comments, combined with a secondary filter that removes the strings that contain extraneous digits.


Consider a function f which constructs a bitmask for each string with bit i set if digit i is in the string.

For example,

f('0')    = 0b0000000001
f('00')   = 0b0000000001
f('1')    = 0b0000000010
f('1100') = 0b0000000011

Then I suggest storing a list of strings for each bitmask.

For example,

Bitmask 0b0000000001 -> ['0','00']

Once you have prepared this data structure (which is the same size as your original list), you can then easily access all the strings for a particular filter by accessing all lists where the bitmask is a subset of the digits in your filter.

So for your example of filter [0,3,4] you would return the lists from:

Strings containing just 0
Strings containing just 3
Strings containing just 4
Strings containing 0 and 3
Strings containing 0 and 4
Strings containing 3 and 4
Strings containing 0 and 3 and 4
Example Python Code
from collections import defaultdict
import itertools

raw_data = [
  '300303334',
  '53210234',
  '123456789',
  '5374576807063874'
]

def preprocess(raw_data):
    data = defaultdict(list)
    for s in raw_data:
        bitmask = 0
        for digit in s:
            bitmask |= 1<<int(digit)
        data[bitmask].append(s)
    return data

def filter(data,mask):
    for r in range(len(mask)):
        for m in itertools.combinations(mask,r+1):
            bitmask = sum(1<<digit for digit in m)
            for s in data[bitmask]:
                yield s

data = preprocess(raw_data)

for a in filter(data, [0,1,2,3,4,5]):
    print a

Just for kicks, I have coded up Jim's lovely algorithm and the Perl is here if anyone wants to play with it. Please do not accept this as an answer or anything, pass all credit to Jim:

#!/usr/bin/perl
use strict;
use warnings;

my $Debug=1;
my $Nwords=1000;

my ($word,$N,$value,$i,$j,$k);
my (@dictionary,%Lookup);

################################################################################
# Generate "words" with random number of characters 5-30
################################################################################
print "DEBUG: Generating $Nwords word dictionary\n" if $Debug;
for($i=0;$i<$Nwords;$i++){
   $j = rand(25) + 5;   # length of this word
   $word="";
   for($k=0;$k<$j;$k++){
      $word = $word . int(rand(10));
   }
   $dictionary[$i]=$word;
   print "$word\n" if $Debug;
}

# Add some obvious test cases
$dictionary[++$i]="0" x 50;
$dictionary[++$i]="1" x 50;
$dictionary[++$i]="2" x 50;
$dictionary[++$i]="3" x 50;
$dictionary[++$i]="4" x 50;
$dictionary[++$i]="5" x 50;
$dictionary[++$i]="6" x 50;
$dictionary[++$i]="7" x 50;
$dictionary[++$i]="8" x 50;
$dictionary[++$i]="9" x 50;
$dictionary[++$i]="0123456789";

################################################################################
# Encode words
################################################################################
for $word (@dictionary){
   $value=0;
   for($i=0;$i<length($word);$i++){
      $N=substr($word,$i,1);
      $value |= 1 << $N;
   }
   push(@{$Lookup{$value}},$word);
   print "DEBUG: $word encoded as $value\n" if $Debug;
}

################################################################################
# Do lookups
################################################################################
   while(1){
      print "Enter permitted digits, separated with commas: ";
      my $line=<STDIN>;
      my @digits=split(",",$line);
      $value=0;
      for my $d (@digits){
        $value |= 1<<$d;
      }
      print "Value: $value\n";
      print join(", ",@{$Lookup{$value}}),"\n\n" if defined $Lookup{$value};
   }

I like Jim Mischel's approach. It has pretty efficient look up and bounded memory usage. Code in C follows:

#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <string.h>
#include <readline/readline.h>
#include <readline/history.h>

enum {
    zero = '0',
    nine = '9',
    numbers = nine - zero + 1,
    masks = 1 << numbers,
};

typedef uint16_t mask;

struct list {
    char *s;
    struct list *next;
};

typedef struct list list_cell;
typedef struct list *list;

static inline int is_digit(char c) { return c >= zero && c <= nine; }
static inline mask char2mask(char c) { return 1 << (c - zero); }

static inline mask add_char2mask(mask m, char c) {
    return m | (is_digit(c) ? char2mask(c) : 0);
}

static inline int is_set(mask m, mask n) { return (m & n) != 0; }

static inline int is_set_char(mask m, char c) { return is_set(m, char2mask(c)); }

static inline int is_submask(mask sub, mask m) { return (sub & m) == sub; }

static inline char *sprint_mask(char buf[11], mask m) {
    char *s = buf;
    char i;
    for(i = zero; i <= nine; i++)
        if(is_set_char(m, i)) *s++ = i;
    *s = 0;
    return buf;
}

static inline mask get_mask(char *s) {
    mask m=0;
    for(; *s; s++)
        m = add_char2mask(m, *s);
    return m;
}

static inline int is_empty(list l) { return !l; }

static inline list insert(list *l, char *s) {
    list cell = (list)malloc(sizeof(list_cell));
    cell->s = s;
    cell->next = *l;
    return *l = cell;
}

static void *foreach(void *f(char *, void *), list l, void *init) {
    for(; !is_empty(l); l = l->next)
        init = f(l->s, init);
    return init;
}

struct printer_state {
    int first;
    FILE *f;
};

static void *prin_list_member(char *s, void *data) {
    struct printer_state *st = (struct printer_state *)data;
    if(st->first) {
        fputs(", ", st->f);
    } else
        st->first = 1;
    fputs(s, st->f);
    return data;
}

static void print_list(list l) {
    struct printer_state st = {.first = 0, .f = stdout};
    foreach(prin_list_member, l, (void *)&st);
    putchar('\n');
}

static list *init_lu(void) { return (list *)calloc(sizeof(list), masks); }

static list *insert2lu(list lu[masks], char *s) {
    mask i, m = get_mask(s);
    if(m)       // skip string without any number
        for(i = m; i < masks; i++)
            if(is_submask(m, i))
                insert(lu+i, s);
    return lu;
}

int usage(const char *name) {
    fprintf(stderr, "Usage: %s filename\n", name);
    return EXIT_FAILURE;
}

#define handle_error(msg) \
    do { perror(msg); exit(EXIT_FAILURE); } while (0)

static inline void chomp(char *s) { if( (s = strchr(s, '\n')) ) *s = '\0'; }

list *load_file(FILE *f) {
    char *line = NULL;
    size_t len = 0;
    ssize_t read;
    list *lu = init_lu();

    for(; (read = getline(&line, &len, f)) != -1; line = NULL) {
        chomp(line);
        insert2lu(lu, line);
    }

    return lu;
}

void read_reqs(list *lu) {
    char *line;
    char buf[11];

    for(; (line = readline("> ")); free(line))
        if(*line) {
            add_history(line);
            mask m = get_mask(line);
            printf("mask: %s\nstrings: ", sprint_mask(buf, m));
            print_list(lu[m]);
        };

    putchar('\n');
}

int main(int argc, const char* argv[] ) {
    const char *name = argv[0];
    FILE *f;
    list *lu;

    if(argc != 2) return usage(name);

    f = fopen(argv[1], "r");
    if(!f) handle_error("open");
    lu = load_file(f);
    fclose(f);

    read_reqs(lu);

    return EXIT_SUCCESS;
}

To compile use

gcc -lreadline -o digitfilter digitfilter.c

And test run:

$ cat data.txt 
300303334
53210234
123456789
5374576807063874
$ ./digitfilter data.txt 
> 034
mask: 034
strings: 300303334
> 0,1,2,3,4,5
mask: 012345
strings: 53210234, 300303334
> 0345678
mask: 0345678
strings: 5374576807063874, 300303334

Put each value into a set-- Eg.: '300303334'={3, 0, 4}.

Since the length of your data items are bound by a constant (50), you can do these at O(1) time for each item using Java HashSet. The overall complexity of this phase adds up to O(n).

For each filter set, use containsAll() of HashSet to see whether each of these data items is a subset of your filter. Takes O(n).

Takes O(m*n) in the overall where n is the number of data items and m the number of filters.


Need Your Help

tablelayout delete tablerow in android

android tablelayout tablerow

final TableLayout table = (TableLayout) findViewById(R.id.tableLayout);

AngularJs file downloading WebAPI

angularjs asp.net-web-api

I need to download a file from server using Web API and angularJs. I am using below code for in API controller. When i hit the API through browser i can able to download the file. i don't have no i...