python - UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014' -
i'm getting error unicodeencodeerror: 'latin-1' codec can't encode character u'\u2014'
i'm trying load lots of news articles mysqldb. i'm having difficulty handling non-standard characters, hundreds of these errors sorts of characters. can handle them individually using .replace() although more complete solution handle them correctly.
ubuntu@ip-10-0-0-21:~/scripts/work$ python test_db_load_error.py traceback (most recent call last): file "test_db_load_error.py", line 27, in <module> cursor.execute(sql_load) file "/usr/lib/python2.7/dist-packages/mysqldb/cursors.py", line 157, in execute query = query.encode(charset) unicodeencodeerror: 'latin-1' codec can't encode character u'\u2014' in position 158: ordinal not in range(256)
my script;
import mysqldb mdb goose import goose import string import datetime host = 'rds.amazonaws.com' user = 'news' password = 'xxxxxxx' db_name = 'news_reader' conn = mdb.connect(host, user, password, db_name) url = 'http://www.dailymail.co.uk/wires/ap/article-3060183/andrew-lesnie-lord-rings-cinematographer-dies.html?ito=1490&ns_mchannel=rss&ns_campaign=1490' g = goose() article = g.extract(url=url) body = article.cleaned_text body = body.replace("'","`") load_date = str(datetime.datetime.now()) summary = article.meta_description title = article.title image = article.top_image sql_load = "insert articles " \ " (title,summary,article,,image,source,load_date) " \ " values ('%s','%s','%s','%s','%s','%s');" % \ (title,summary,body,image,url,load_date) cursor = conn.cursor() cursor.execute(sql_load) #conn.commit()
any appreciated.
if database configured latin-1, cannot store non-latin-1 characters in it. includes u+2014, em dash.
the ideal solution switch database configured utf-8. pass charset='utf-8'
when creating database, , every time connect it. (if have existing data, want use mysql tools migrate old database new one, instead of python code, basic idea same.)
however, isn't possible. maybe have other software can't updated, requires latin-1, , needs share same database. or maybe you've mixed latin-1 text , binary data in ways can't programmatically unmixed, or database huge migrate, or whatever. in case, have 2 choices:
destructively convert strings latin-1 before storing , searching. example, might want convert em dash
-
, or--
, or maybe it's not important , can convert non-latin-1 characters?
(which faster , simpler).come encoding scheme smuggle non-latin-1 characters database. means searches become more complicated, or can't done directly in database.
Comments
Post a Comment