How to open html file?
I have html file called test.html it has one word בדיקה.
I open the test.html and print it's content using this block of code:
file = open("test.html", "r") print file.read()
but it prints ??????, why this happened and how could I fix it?
BTW. when I open text file it works good.
Edit: I'd tried this:
>>> import codecs >>> f = codecs.open("test.html",'r') >>> print f.read() ?????
import codecs f=codecs.open("test.html", 'r') print f.read()
Try something like this.
You can read HTML page using 'urllib'.
#python 2.x import urllib page = urllib.urlopen("your path ").read() print page
you can make use of the following code:
from __future__ import division, unicode_literals import codecs from bs4 import BeautifulSoup f=codecs.open("test.html", 'r', 'utf-8') document= BeautifulSoup(f.read()).get_text() print document
If you want to delete all the blank lines in between and get all the words as a string (also avoid special characters, numbers) then also include:
import nltk from nltk.tokenize import word_tokenize docwords=word_tokenize(document) for line in docwords: line = (line.rstrip()) if line: if re.match("^[A-Za-z]*$",line): if (line not in stop and len(line)>1): st=st+" "+line print st
*define st as a string initially, like st=""
Use codecs.open with the encoding parameter.
import codecs f = codecs.open("test.html", 'r', 'utf-8')
I encountered this problem today as well. I am using Windows and the system language by default is Chinese. Hence, someone may encounter this Unicode error similarly. Simply add encoding = 'utf-8':
with open("test.html", "r", encoding='utf-8') as f: text= f.read()
import codecs path="D:\\Users\\html\\abc.html" file=codecs.open(path,"rb") file1=file.read() file1=str(file1)
you can use 'urllib' in python3 same as
https://stackoverflow.com/a/27243244/4815313 with few changes.
#python3 import urllib page = urllib.request.urlopen("/path/").read() print(page)