Python unicode search not giving correct answer

I am trying to search hindi words contained one line per file in file-1 and find them in lines in file-2. I have to print the line numbers with the number of words found. This is the code:

import codecs

hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8").readlines()
words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines()
count_arr = []

for counter, line in enumerate(hypernyms):
    count_arr.append(0)
    for word in words:
        if line.find(word) >=0:
            count_arr[counter] +=1

for iterator, count in enumerate(count_arr):
if count>0:
    print iterator, ' ', count

This is finding some words, but ignoring some others The input files are: File-1:

पौधा  
वनस्पति

File-2:

वनस्पति, पेड़-पौधा  
वस्तु-भाग, वस्तु-अंग, वस्तु_भाग, वस्तु_अंग  
पादप_समूह, पेड़-पौधे, वनस्पति_समूह  
पेड़-पौधा

This gives output:

0 1  
3 1

Clearly, it is ignoring वनस्पति and searching for पौधा only. I have tried with other inputs as well. It only searches for one word. Any idea how to correct this?

Answers


That because You don't remove the "\n" charactor at the end of lines. So you don't search "some_pattern\n", not "some_pattern". Use strip() function to chop them off like this:

import codecs

words = [word.strip() for word in codecs.open("hypernyms_en2hi.txt", "r", "utf-8")]
hypernyms = codecs.open("hindi_hypernym.txt", "r", "utf-8")
count_arr = []

for line in hypernyms:
    count_arr.append(0)
    for word in words:
        count_arr[-1] += (word in line)

for count in enumerate(count_arr):
    if count:
        print iterator, ' ', count

I think the problem is here:

words = codecs.open("hypernyms_en2hi.txt", "r", "utf-8").readlines()

.readlines() will leave the line break at the end, so you're not searching for पौधा, you're searching for पौधा\n, and you'll only match at the end of a line. If I use .read().split() instead, I get

0   2
2   1
3   1

Put this code and you will see why that happens,because of the spaces: in file 1 the first word is पौधा[space]....

for i in hypernyms:
    print "file1",i

for i in words:
    print "file2",i

After count_arr = [] and before for counter, line...


Need Your Help

browser compatibility issue when creating scrolling bar

javascript jquery

I have some divs lets say 100 and These boxes are horizontal, I also have 2 arrows and when mouse is positioned over them they move the box to right or left,

Is there a way to establish a HTTPS Connection with Java 1.3?

java https

I have to work on an old 1.3 JVM and I'm asked to create a secure connection to another server. Unfortunately the HttpsURLConnection only appears sinc JVM 1.4.