Scrape values from HTML select/option tags in R -


i'm trying (fairly unsuccessfully) scrape data website (www.majidata.co.ke) using r. i've managed scrape html , parse little unsure how extract bits need!

using xml library scrape data using code:

majidata_get <- get("http://www.majidata.go.ke/town.php?mid=mte=&smid=mtm=") majidata_html <- htmltreeparse(content(majidata_get, as="text")) 

this leaves me (large) xmldocumentcontent. there drop-down list on webpage , want scrape values (which relate names , id no. of different towns). bits want extract numbers between <option value ="xxx"> , name following in capital letters.

<div class="regiondata">        <div id="town_data">         <select id="town" name="town" onchange="town_data(this.value);">          <option value="0" selected="selected">[select town]</option>          <option value="611">ahero</option>          <option value="635">akala</option>          <option value="625">awasi</option>          <option value="628">awendo</option>          <option value="749">bahati</option>          <option value="327">bangale</option> 

ideally, i'd have these in data.frame first column number , second column name e.g.

id       name 611      ahero 635      akala 625      awasi 

etc.

i'm not sure go here. had thought use regex , match pattern within text, though i've read number of forums bad idea better/more efficient use xpath. not sure start though other thinking need use xpathapplysomehow.

the new rvest package makes quick work of , lets use sane css selectors, too.

updated incorporates second request (see comments below)

library(rvest) library(dplyr)  # gets data second popup # returns data frame of town_id, town_name, area_id, area_name addarea <- function(town_id, town_name) {    # make ajax url , grab data   url <- sprintf("http://www.majidata.go.ke/ajax-list-area.php?reg=towns&type=projects&id=%s",                  town_id)   subunits <- html(url)    # reformat data frame town data   data.frame(town_id=town_id,              town_name=town_name,              area_id=subunits %>% html_nodes("option") %>% html_attr("value"),              area_name=subunits %>% html_nodes("option") %>% html_text(),              stringsasfactors=false)[-1,]  }  # data first popup , put dat frame majidata <- html("http://www.majidata.go.ke/town.php?mid=mte=&smid=mtm=") maji <- data.frame(town_id=majidata %>% html_nodes("#town option") %>% html_attr("value"),                    town_name=majidata %>% html_nodes("#town option") %>% html_text(),                    stringsasfactors=false)[-1,]  # pass in name , id our addarea function , make result # data frame data (town , area) combined <- do.call("rbind.data.frame",                     mapply(addarea, maji$town_id,  maji$town_name,                            simplify=false, use.names=false))  # row names aren't super-important, let's keep them tidy rownames(combined) <- null  str(combined)  ## 'data.frame':    1964 obs. of  4 variables: ##  $ town_id  : chr  "611" "635" "625" "628" ... ##  $ town_name: chr  "ahero" "akala" "awasi" "awendo" ... ##  $ area_id  : chr  "60603030101" "60107050201" "60603020101" "61103040101" ... ##  $ area_name: chr  "ahero" "akala" "awasi" "anindo" ...   head(combined)  ##   town_id town_name     area_id area_name ## 1     611     ahero 60603030101     ahero ## 2     635     akala 60107050201     akala ## 3     625     awasi 60603020101     awasi ## 4     628    awendo 61103040101    anindo ## 5     628    awendo 61103050401      sare ## 6     749    bahati 73101010101    bahati 

Comments

Popular posts from this blog

php - Submit Form Data without Reloading page -

linux - Rails running on virtual machine in Windows -