How to open html file?

I have html file called test.html it has one word בדיקה.

I open the test.html and print it's content using this block of code:

file = open("test.html", "r")

but it prints ??????, why this happened and how could I fix it?

BTW. when I open text file it works good.

Edit: I'd tried this:

>>> import codecs
>>> f ="test.html",'r')
>>> print


import codecs"test.html", 'r')

Try something like this.

You can read HTML page using 'urllib'.

 #python 2.x

  import urllib

  page = urllib.urlopen("your path ").read()
  print page

you can make use of the following code:

from __future__ import division, unicode_literals 
import codecs
from bs4 import BeautifulSoup"test.html", 'r', 'utf-8')
document= BeautifulSoup(
print document

If you want to delete all the blank lines in between and get all the words as a string (also avoid special characters, numbers) then also include:

import nltk
from nltk.tokenize import word_tokenize
for line in docwords:
    line = (line.rstrip())
    if line:
        if re.match("^[A-Za-z]*$",line):
            if (line not in stop and len(line)>1):
                st=st+" "+line
print st

*define st as a string initially, like st=""

Use with the encoding parameter.

import codecs
f ="test.html", 'r', 'utf-8')

I encountered this problem today as well. I am using Windows and the system language by default is Chinese. Hence, someone may encounter this Unicode error similarly. Simply add encoding = 'utf-8':

with open("test.html", "r", encoding='utf-8') as f:


import codecs


you can use 'urllib' in python3 same as with few changes.


import urllib

page = urllib.request.urlopen("/path/").read()

