python - Why is this lxml.etree.HTMLPullParser leaking memory? -
i'm trying use lxml's htmlpullparser on linux mint i'm finding memory usage keeps increasing , i'm not sure why. here's test code:
# -*- coding: utf-8 -*- __future__ import division, absolute_import, print_function, unicode_literals import lxml.etree import resource io import default_buffer_size _ in xrange(1000): open('stackoverflow.html', 'r') f: parser = lxml.etree.htmlpullparser() while true: buf = f.read(default_buffer_size) if not buf: break parser.feed(buf) parser.close() # print memory usage print((resource.getrusage(resource.rusage_self)[2] * resource.getpagesize())/1000000.0) stackoverflow.html homepage of stackoverflow i've saved in same folder python script. i've tried adding explicit deletes , clears far nothing has worked. doing wrong?
elements constructed parsers leaking, , can't see api contract violation in code that's causing it. since objects survive manual garbage collection run gc.collect(), best bet try different parsing strategy workaround.
to see root cause, used memory exploration module objgraph , installed xdot view graphs created.
before running code, ran:
in [3]: import objgraph in [4]: objgraph.show_growth() after running code, ran:
in [6]: objgraph.show_growth() tuple 1616 +147 _element 146 +146 list 1100 +24 wrapper_descriptor 1423 +15 weakref 1155 +6 getset_descriptor 677 +4 dict 2777 +4 member_descriptor 315 +3 method_descriptor 891 +2 _tempstore 2 +1 in [7]: import random in [8]: objgraph.show_chain( ...: objgraph.find_backref_chain( ...: random.choice(objgraph.by_type('_element')), objgraph.is_proper_module)) graph written /tmp/objgraph-bfuwa9.dot (8 nodes) spawning graph viewer (xdot) note: numbers might different see depending on webpage viewed.
Comments
Post a Comment