Why does Python say this Netscape cookie file isn't valid?

I'm writing a Google Scholar parser, and based on this answer, I'm setting cookies before grabbing the HTML. This is the contents of my cookies.txt file:

# Netscape HTTP Cookie File
# http://curlm.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.

.scholar.google.com     TRUE    /       FALSE   2147483647      GSP     ID=353e8f974d766dcd:CF=2
.google.com     TRUE    /       FALSE   1317124758      PREF    ID=353e8f974d766dcd:TM=1254052758:LM=1254052758:S=_biVh02e4scrJT1H
.scholar.google.co.uk   TRUE    /       FALSE   2147483647      GSP     ID=f3f18b3b5a7c2647:CF=2
.google.co.uk   TRUE    /       FALSE   1317125123      PREF    ID=f3f18b3b5a7c2647:TM=1254053123:LM=1254053123:S=UqjRcTObh7_sARkN

and this is the code I'm using to grab the HTML:

import http.cookiejar
import urllib.request, urllib.parse, urllib.error

def get_page(url, headers="", params=""):
    filename = "cookies.txt"
    request = urllib.request.Request(url, None, headers, params)
    cookies = http.cookiejar.MozillaCookieJar(filename, None, None)
    cookies.load()
    cookie_handler = urllib.request.HTTPCookieProcessor(cookies)
    redirect_handler = urllib.request.HTTPRedirectHandler()
    opener = urllib.request.build_opener(redirect_handler,cookie_handler)
    response = opener.open(request)
    return response

start = 0
search = "Ricardo Altamirano"
results_per_fetch = 20
host = "http://scholar.google.com"
base_url = "/scholar"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; U; ru; rv:5.0.1.6) Gecko/20110501 Firefox/5.0.1 Firefox/5.0.1'}
params = urllib.parse.urlencode({'start' : start,
                                 'q': '"' + search + '"',
                                 'btnG' : "",
                                 'hl' : 'en',
                                 'num': results_per_fetch,
                                 'as_sdt' : '1,14'})

url = base_url + "?" + params
resp = get_page(host + url, headers, params)

The full traceback is:

Traceback (most recent call last):
  File "C:/Users/ricardo/Desktop/Google-Scholar/BibTex/test.py", line 29, in <module>
    resp = get_page(host + url, headers, params)
  File "C:/Users/ricardo/Desktop/Google-Scholar/BibTex/test.py", line 8, in get_page
    cookies.load()
  File "C:\Python32\lib\http\cookiejar.py", line 1767, in load
    self._really_load(f, filename, ignore_discard, ignore_expires)
  File "C:\Python32\lib\http\cookiejar.py", line 1997, in _really_load
    filename)
http.cookiejar.LoadError: 'cookies.txt' does not look like a Netscape format cookies file

I've looked around for documentation on the Netscape cookie file format, but I can't find anything that shows me the problem. Are there newlines that need to be included? I changed the line endings to Unix style, just in case, but that didn't solve the problem. The closest specification I can find is this, which doesn't indicate anything to me that I'm missing. The fields on each of the last four lines are separated by tabs, not spaces, and everything else looks correct to me.

Answers


I see nothing in your example code or copy of the cookies.txt file that is obviously wrong.

I've checked the source code for the MozillaCookieJar._really_load method, which throws the exception that you see.

The first thing this method does, is read the first line of the file you specified (using f.readline()) and use re.search to look for the regular expression pattern "#( Netscape)? HTTP Cookie File". This is what fails for your file.

It certainly looks like your cookies.txt would match that format, so the error you see is quite surprising.

Note that your file is opened with a simple open(filename) call earlier on, so it'll be opened in text mode with universal line ending support, meaning it doesn't matter that you are running this on Windows. The code will see \n newline terminated strings, regardless of what newline convention was used in the file itself.

What I'd do in this case is triple-check that your file's first line is really correct. It needs to either contain "# HTTP Cookie File" or "# Netscape HTTP Cookie File" (spaces only, no tabs, between the words, capitalisation matching). Test this with the python prompt:

>>> f = open('cookies.txt')
>>> line = f.readline()
>>> line
'# Netscape HTTP Cookie File\n'
>>> import re
>>> re.search("#( Netscape)? HTTP Cookie File", line)
<_sre.SRE_Match object at 0x10fecfdc8>

Python echoed the line representation back to me when I typed line at the prompt, including the \n newline character. Any surprises like tab characters or unicode zero-width spaces will show up there as escape codes. I also verified that the regular expression used by the cookiejar code matches.

You can also use the pdb python debugger to verify what the http.cookiejar module really does:

>>> import pdb
>>> import http.cookiejar
>>> jar = http.cookiejar.MozillaCookieJar('cookies.txt')
>>> pdb.run('jar.load()')
> <string>(1)<module>()
(Pdb) s
--Call--
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1759)load()
-> def load(self, filename=None, ignore_discard=False, ignore_expires=False):
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1761)load()
-> if filename is None:
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1762)load()
-> if self.filename is not None: filename = self.filename
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1765)load()
-> f = open(filename)
(Pdb) n
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1766)load()
-> try:
(Pdb) 
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1767)load()
-> self._really_load(f, filename, ignore_discard, ignore_expires)
(Pdb) s
--Call--
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1989)_really_load()
-> def _really_load(self, f, filename, ignore_discard, ignore_expires):
(Pdb) s
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1990)_really_load()
-> now = time.time()
(Pdb) n
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1992)_really_load()
-> magic = f.readline()
(Pdb) 
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1993)_really_load()
-> if not self.magic_re.search(magic):
(Pdb) 
> /opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/http/cookiejar.py(1999)_really_load()
-> try:

In the above sample pdb session I used a combination of the step and next commands to verify that the regular expression test (self.magic_re.search(magic)) actually passed.


As of my scenario, two modifications are needed to the MozillaCookieJar under (/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/)

  1. The magic header

    You can remove the check logic or add that magic header which I prefer

    '# Netscape HTTP Cookie File

  2. The new file format seems allow you to omit the expires

    vals = line.split("\t")
    if len(vals) == 7 :
        domain, domain_specified, path, secure, expires, name, value = vals
    if len(vals) == 6 :
        domain, domain_specified, path, secure, name, value = vals
        expires = None
    

Lastly I really hope the implementation could be updated to the new changes.


Need Your Help

With Traits, should I use a T prefix and if yes put it before or after any other prefix?

smalltalk squeak traits pharo

When creating Traits in Pharo+Squeak, is it proper to use a T prefix like TMyTrait and if yes, should the T go before any other prefix like TMPMyTrait (where "MP" is the other prefix), or after, like

Is there a good repository of IE 6 considerations?

javascript html css internet-explorer-6

I've been lucky to develop websites for the past 3 years for companies that don't need to support IE 6 in any way.