python - Why is this lxml.etree.HTMLPullParser leaking memory? -
i'm trying use lxml's htmlpullparser on linux mint i'm finding memory usage keeps increasing , i'm not sure why. here's test code:
# -*- coding: utf-8 -*- __future__ import division, absolute_import, print_function, unicode_literals import lxml.etree import resource io import default_buffer_size _ in xrange(1000): open('stackoverflow.html', 'r') f: parser = lxml.etree.htmlpullparser() while true: buf = f.read(default_buffer_size) if not buf: break parser.feed(buf) parser.close() # print memory usage print((resource.getrusage(resource.rusage_self)[2] * resource.getpagesize())/1000000.0)
stackoverflow.html homepage of stackoverflow i've saved in same folder python script. i've tried adding explicit deletes , clears far nothing has worked. doing wrong?
elements constructed parsers leaking, , can't see api contract violation in code that's causing it. since objects survive manual garbage collection run gc.collect()
, best bet try different parsing strategy workaround.
to see root cause, used memory exploration module objgraph , installed xdot view graphs created.
before running code, ran:
in [3]: import objgraph in [4]: objgraph.show_growth()
after running code, ran:
in [6]: objgraph.show_growth() tuple 1616 +147 _element 146 +146 list 1100 +24 wrapper_descriptor 1423 +15 weakref 1155 +6 getset_descriptor 677 +4 dict 2777 +4 member_descriptor 315 +3 method_descriptor 891 +2 _tempstore 2 +1 in [7]: import random in [8]: objgraph.show_chain( ...: objgraph.find_backref_chain( ...: random.choice(objgraph.by_type('_element')), objgraph.is_proper_module)) graph written /tmp/objgraph-bfuwa9.dot (8 nodes) spawning graph viewer (xdot)
note: numbers might different see depending on webpage viewed.
Comments
Post a Comment