Scrape values from HTML select/option tags in R -
i'm trying (fairly unsuccessfully) scrape data website (www.majidata.co.ke) using r. i've managed scrape html , parse little unsure how extract bits need!
using xml
library scrape data using code:
majidata_get <- get("http://www.majidata.go.ke/town.php?mid=mte=&smid=mtm=") majidata_html <- htmltreeparse(content(majidata_get, as="text"))
this leaves me (large) xmldocumentcontent. there drop-down list on webpage , want scrape values (which relate names , id no. of different towns). bits want extract numbers between <option value ="xxx">
, name following in capital letters.
<div class="regiondata"> <div id="town_data"> <select id="town" name="town" onchange="town_data(this.value);"> <option value="0" selected="selected">[select town]</option> <option value="611">ahero</option> <option value="635">akala</option> <option value="625">awasi</option> <option value="628">awendo</option> <option value="749">bahati</option> <option value="327">bangale</option>
ideally, i'd have these in data.frame first column number , second column name e.g.
id name 611 ahero 635 akala 625 awasi
etc.
i'm not sure go here. had thought use regex , match pattern within text, though i've read number of forums bad idea better/more efficient use xpath. not sure start though other thinking need use xpathapply
somehow.
the new rvest package makes quick work of , lets use sane css selectors, too.
updated incorporates second request (see comments below)
library(rvest) library(dplyr) # gets data second popup # returns data frame of town_id, town_name, area_id, area_name addarea <- function(town_id, town_name) { # make ajax url , grab data url <- sprintf("http://www.majidata.go.ke/ajax-list-area.php?reg=towns&type=projects&id=%s", town_id) subunits <- html(url) # reformat data frame town data data.frame(town_id=town_id, town_name=town_name, area_id=subunits %>% html_nodes("option") %>% html_attr("value"), area_name=subunits %>% html_nodes("option") %>% html_text(), stringsasfactors=false)[-1,] } # data first popup , put dat frame majidata <- html("http://www.majidata.go.ke/town.php?mid=mte=&smid=mtm=") maji <- data.frame(town_id=majidata %>% html_nodes("#town option") %>% html_attr("value"), town_name=majidata %>% html_nodes("#town option") %>% html_text(), stringsasfactors=false)[-1,] # pass in name , id our addarea function , make result # data frame data (town , area) combined <- do.call("rbind.data.frame", mapply(addarea, maji$town_id, maji$town_name, simplify=false, use.names=false)) # row names aren't super-important, let's keep them tidy rownames(combined) <- null str(combined) ## 'data.frame': 1964 obs. of 4 variables: ## $ town_id : chr "611" "635" "625" "628" ... ## $ town_name: chr "ahero" "akala" "awasi" "awendo" ... ## $ area_id : chr "60603030101" "60107050201" "60603020101" "61103040101" ... ## $ area_name: chr "ahero" "akala" "awasi" "anindo" ... head(combined) ## town_id town_name area_id area_name ## 1 611 ahero 60603030101 ahero ## 2 635 akala 60107050201 akala ## 3 625 awasi 60603020101 awasi ## 4 628 awendo 61103040101 anindo ## 5 628 awendo 61103050401 sare ## 6 749 bahati 73101010101 bahati
Comments
Post a Comment