regex - Parsing links from html with gawk -
i'm trying take googles html, , parse out links. use curl obtain html pass gawk. gawk used match() function, , works returns small amount of links. maybe 10 @ most. if test regex on regex101.com returns 51 links using g global modifier. how can use in gawk obtain links (relative , absolute)?
#!/bin/bash html=$(curl -l "http://google.com") echo "${html}" | gawk ' begin { rs=" " ignorecase=1 } { match($0, /href=\"([^\"]*)/, array); if (length(array[1]) > 0) { print array[1]; } }'
instead of awk can use grep -op:
curl -sl "http://google.com" | grep -ipo 'href="\k[^"]+' however fetching 31 links me. may vary browser because google.com serves different page different locations/signed in users.
Comments
Post a Comment