python - Why is this lxml.etree.HTMLPullParser leaking memory? -

January 15, 2012

i'm trying use lxml's htmlpullparser on linux mint i'm finding memory usage keeps increasing , i'm not sure why. here's test code:

# -*- coding: utf-8 -*- __future__ import division, absolute_import, print_function, unicode_literals import lxml.etree import resource io import default_buffer_size  _ in xrange(1000): open('stackoverflow.html', 'r') f:     parser = lxml.etree.htmlpullparser()     while true:         buf = f.read(default_buffer_size)         if not buf: break         parser.feed(buf)     parser.close()      # print memory usage     print((resource.getrusage(resource.rusage_self)[2] * resource.getpagesize())/1000000.0)

stackoverflow.html homepage of stackoverflow i've saved in same folder python script. i've tried adding explicit deletes , clears far nothing has worked. doing wrong?

elements constructed parsers leaking, , can't see api contract violation in code that's causing it. since objects survive manual garbage collection run gc.collect(), best bet try different parsing strategy workaround.

to see root cause, used memory exploration module objgraph , installed xdot view graphs created.

before running code, ran:

in [3]: import objgraph  in [4]: objgraph.show_growth()

after running code, ran:

in [6]: objgraph.show_growth() tuple                  1616      +147 _element                146      +146 list                   1100       +24 wrapper_descriptor     1423       +15 weakref                1155        +6 getset_descriptor       677        +4 dict                   2777        +4 member_descriptor       315        +3 method_descriptor       891        +2 _tempstore                2        +1  in [7]: import random  in [8]: objgraph.show_chain(    ...: objgraph.find_backref_chain(    ...: random.choice(objgraph.by_type('_element')), objgraph.is_proper_module)) graph written /tmp/objgraph-bfuwa9.dot (8 nodes) spawning graph viewer (xdot)

note: numbers might different see depending on webpage viewed.

Search This Blog

Ruby Code

python - Why is this lxml.etree.HTMLPullParser leaking memory? -

Comments

Post a Comment

Popular posts from this blog

php - failed to open stream: HTTP request failed! HTTP/1.0 400 Bad Request -

command line - Use qwinsta in PowerShell ISE -

java - Show Soft Keyboard when EditText Appears -