html lists - Strip multiple li elements from google with VBA -
my goal strip google search results page 1 vba excel. until far managed strip first result. head, link , date stored in cells 4, 5, 6. have make loop other li's, can't straight. function stores date isn't optimal coded think. knows answer?
sub xmlhttp() dim url string, lastrow long dim xmlhttp object, html object, objresultdiv object, objh3 object, link object, objdatum object, ddatum object dim start_time date dim end_time date lastrow = range("a" & rows.count).end(xlup).row dim cookie string dim result_cookie string start_time = time debug.print "start_time:" & start_time = 2 lastrow url = "https://www.google.co.in/search?q=" & cells(i, 3) & "skipr" & "&rnd=" & worksheetfunction.randbetween(1, 10000) set xmlhttp = createobject("msxml2.serverxmlhttp") xmlhttp.open "get", url, false xmlhttp.setrequestheader "content-type", "text/xml" xmlhttp.setrequestheader "user-agent", "mozilla/5.0 (windows nt 6.1; rv:25.0) gecko/20100101 firefox/25.0" xmlhttp.send set html = createobject("htmlfile") html.body.innerhtml = xmlhttp.responsetext set objresultdiv = html.getelementbyid("rso") set objh3 = objresultdiv.getelementsbytagname("h3")(0) set link = objh3.getelementsbytagname("a")(0) set objdatum = objresultdiv.getelementsbytagname("span")(2) str_text = replace(link.innerhtml, "<em>", "") str_text = replace(str_text, "</em>", "") dat_text = objdatum.innerhtml cells(i, 4) = str_text cells(i, 5) = link.href cells(i, 6) = dat_text doevents next end_time = time debug.print "end_time:" & end_time debug.print "done" & "time taken : " & datediff("n", start_time, end_time) msgbox "done" & "time taken : " & datediff("n", start_time, end_time) end sub
you need iterate collections returned getelementsbytagname
call instead of returning first element array index (0)
i had similar project, below tips & approach reference, might assist in working & maintaining code in future:
first, instead of using createobject
prefer reference object libraries expose com objects, gives me ability browse functions , properties of each object f2
, gives me code completion (speed & less bugs) within vba editor (f7
takes code view).
giving me documentation , code completion:
also, use these const clarity
'see ready state : https://msdn.microsoft.com/en-us/library/ie/ms534361(v=vs.85).aspx const readystate_uninitialized = 0 const readystate_loading = 1 const readystate_loaded = 2 const readystate_interactive = 3 const readystate_complete = 4
finally, using domdocument60
parse xml document object model in memory.
and mshtml.htmldocument
parse html document , iterate table rows.
below code iterate returned rows table within html document embedded in initial xml document returned webserver:
dim xmldoc domdocument60 set xmldoc = getxmldocument("http://www.nbg.ge/rss.php") 'extract publication date debug.print xmldoc.getelementsbytagname("pubdate")(0).text 'unwrap html document cdata in "//item/description" element dim htmldoc new mshtml.htmldocument htmldoc.body.innerhtml = xmldoc.selectnodes("//item/description")(0).text 'extract table data html document dim tr ihtmlelement, td ihtmlelement each tr in htmldoc.getelementsbytagname("tr") each td in tr.children 'each cell in current row debug.print " " & td.innerhtml next td 'next row debug.print "-----" next tr
sample data returned webservice calling:
<rss version="2.0"> <channel> <title>rss nbg currency rates</title> <link>https://www.nbg.gov.ge/index.php?m=236&lang=geo</link> <description>currency rates</description> <language>geo</language> <copyright>copyright 2015, nbg</copyright> <pubdate>wed, 29 apr 2015 12:39:50 +0400</pubdate> <lastbuilddate>wed, 29 apr 2015 12:39:50 +0400</lastbuilddate> <managingeditor>alex@proservice.ge</managingeditor> <webmaster>alex@proservice.ge</webmaster> <item> <title>currency rates 2015-04-29</title> <link>https://www.nbg.gov.ge/index.php?m=236&lang=geo</link> <description> <![cdata[ <table border="0"> <tr> <td>aed</td> <td>10 არაბეთის გაერთიანებული საამიროების დირჰამი</td> <td>6.2858</td> <td><img src="https://www.nbg.gov.ge/images/green.gif"></td> <td>0.0640</td> </tr><tr> <td>amd</td> <td>1000 სომხური დრამი</td> <td>4.8676</td> <td><img src="https://www.nbg.gov.ge/images/green.gif"></td> <td>0.0414</td> </tr> </table> ]]> </description> <pubdate>wed, 29 apr 2015 12:39:50 +0400</pubdate> <guid> https://www.nbg.gov.ge/index.php?m=236&lang=geo&date=2015-04-29 </guid> </item> </channel> </rss>
and function gets document webserver (only works if added references shown in above pictures)
function getxmldocument(url string) msxml2.domdocument60 dim xhr new xmlhttp60 dim doc new domdocument60 dim msg string xhr .open bstrmethod:="get", bstrurl:=url, varasync:=false on error goto senderror .send on error goto 0 'http status codes - http://en.wikipedia.org/wiki/list_of_http_status_codes '200 = success - ok if .readystate = readystate_complete , .status = 200 'debug.print .responsetext doc.loadxml (.responsetext) else msg = "error" & vbnewline & "ready state: " & .readystate & _ vbnewline & "http request status: " & .status goto error end if end set getxmldocument = doc exit function senderror: 'by default access data source accross internet dissabled 'go internet options & under security>custom level>misc>access data sources accross domains> enable 'see: http://stackoverflow.com/a/17402920 msgbox "make sure access data sources accross domains enabled under internet options>security>custom", vbokonly, "could not send request server" error: msgbox msg, vbokonly, "unexpected error" end function
Comments
Post a Comment