regex - Parsing links from html with gawk -

August 15, 2012

i'm trying take googles html, , parse out links. use curl obtain html pass gawk. gawk used match() function, , works returns small amount of links. maybe 10 @ most. if test regex on regex101.com returns 51 links using g global modifier. how can use in gawk obtain links (relative , absolute)?

#!/bin/bash  html=$(curl -l "http://google.com")  echo "${html}" | gawk '   begin {     rs=" "     ignorecase=1   }   {     match($0, /href=\"([^\"]*)/, array);     if (length(array[1]) > 0) {       print array[1];     }   }'

instead of awk can use grep -op:

curl -sl "http://google.com" | grep -ipo 'href="\k[^"]+'

however fetching 31 links me. may vary browser because google.com serves different page different locations/signed in users.

Search This Blog

Ruby Code

regex - Parsing links from html with gawk -

Comments

Post a Comment

Popular posts from this blog

java - Spring Data JPA: Why findOne(id) executing delete query internally? -

python - Mongodb How to add addtional information when aggregating? -

java - Incorrect order of records in M-M relationship in hibernate -