R_관련 정리파일

R_스크래핑 scraping[영화 평점, 게시판, 논문 초록]

내생에달나라까지 2021. 1. 31. 09:06

R_고려대_Scraping_강의정리.Rmd

0.02MB

개요
Part 1: XPath with XML
Part 2: 연구논문 (arXiv Papers)
Part 3: 영화평점 (IMDB Top 50 Movies)
Part 4: 한글Page:ppomppu

개요

고려대 강의 정리(youtube) - 강필성 교수
- 2020년도 2학기 고려대학교 산업경영공학부 데이터분석을 위한 프로그래밍 언어 (R)
- [Korea University] Programming Language for Data Analytics (Undergraduate) KoreaUniv DSBA _ 06-1 ~ 06-4)
- 06-1: Web Scraping - Backgrounds
- 8 시간 강의를 들으면 R로 Scraping 전문가
- 익숙하지 않아서 ==> 하루만 투자 하자
- 해당 강의 강추(학부 1학년 강의로 알고 있음,.)
- 대부분은 강의자료 이며, 강의중 설명한 부분을 Comment에 추가한 수준
4편의 강의로 구성
- 1편은 XML의 구조를 R에서 처리하는 방법: 이전에 Java에서 XML Util로 구현하던 부분과 동일하네요…[2021/1/29]
- 2편: 논문이 내용을 스크렘블한 예제 [2021/1/30]
- 3편: 영화 평점(IMDB)
- 4편: 한글 게시판
강의 관련
- 브라우저는 chrome을 권장함 [크롬의 개발도구에 대한 이해 - ‘F12’]
- web 화면의 이해 [소스 보기]
- 강의를 위해서는 인터냇 연결이 되어야 함

Part 1: XPath with XML

Xpath: syntax for XML document For more information, visit w3school
- XPATH 문서 구조 이해 필요
- XML노드의 특징을 이해하고 필요한 부분만 선택할 수 있어야 웹에서 필요한 부분만 가져올 수 있다

if (!requireNamespace("XML") )
  install.packages("XML")
library("XML")

# XML/HTML parsing
obamaurl <- "http://www.obamaspeeches.com/"
obamaroot <- htmlParse(obamaurl)  # 실제로 웹 접속
#obamaroot  

# 필요한 내용만 정리하는 작업 수행 필요
# 불필요한 부분 제거하는 방법 ..--> 표현법을 알아야 함


# Xpath example
xmlfile <- "scraping/xml_example.xml"
tmpxml <- xmlParse(xmlfile)
root <- xmlRoot(tmpxml)
root

<bookstore>
  <book category="cooking">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <price>30.00</price>
  </book>
  <book category="children">
    <title lang="en">Harry Potter</title>
    <author>J K. Rowling</author>
    <year>2005</year>
    <price>29.99</price>
  </book>
  <book category="web">
    <title lang="en">XQuery Kick Start</title>
    <author>James McGovern</author>
    <author>Per Bothner</author>
    <author>Kurt Cagle</author>
    <author>James Linn</author>
    <author>Vaidyanathan Nagarajan</author>
    <year>2003</year>
    <price>49.99</price>
  </book>
  <book category="web">
    <title lang="en">Learning XML</title>
    <author>Erik T. Ray</author>
    <year>2003</year>
    <price>39.95</price>
  </book>
</bookstore>

# Select children node
XML::xmlChildren(root)[[1]]  #대괄호 2개 --> 리스트로 취급된다

<book category="cooking">
  <title lang="en">Everyday Italian</title>
  <author>Giada De Laurentiis</author>
  <year>2005</year>
  <price>30.00</price>
</book>

#결과와 연동하여 생각해 보자
XML::xmlChildren(xmlChildren(root)[[1]])[[1]]

<title lang="en">Everyday Italian</title>

XML::xmlChildren(xmlChildren(root)[[1]])[[2]]

<author>Giada De Laurentiis</author>

XML::xmlChildren(xmlChildren(root)[[1]])[[3]]

<year>2005</year>

XML::xmlChildren(xmlChildren(root)[[1]])[[4]]

<price>30.00</price>

# Selecting nodes
# '/'로 시작하면 Root 부터, 
XML::xpathSApply(root, "/bookstore/book[1]")

[[1]]
<book category="cooking">
  <title lang="en">Everyday Italian</title>
  <author>Giada De Laurentiis</author>
  <year>2005</year>
  <price>30.00</price>
</book>

XML::xpathSApply(root, "/bookstore/book[last()]")

[[1]]
<book category="web">
  <title lang="en">Learning XML</title>
  <author>Erik T. Ray</author>
  <year>2003</year>
  <price>39.95</price>
</book>

XML::xpathSApply(root, "/bookstore/book[last()-1]")

[[1]]
<book category="web">
  <title lang="en">XQuery Kick Start</title>
  <author>James McGovern</author>
  <author>Per Bothner</author>
  <author>Kurt Cagle</author>
  <author>James Linn</author>
  <author>Vaidyanathan Nagarajan</author>
  <year>2003</year>
  <price>49.99</price>
</book>

XML::xpathSApply(root, "/bookstore/book[position()<3]")

[[1]]
<book category="cooking">
  <title lang="en">Everyday Italian</title>
  <author>Giada De Laurentiis</author>
  <year>2005</year>
  <price>30.00</price>
</book> 

[[2]]
<book category="children">
  <title lang="en">Harry Potter</title>
  <author>J K. Rowling</author>
  <year>2005</year>
  <price>29.99</price>
</book>

# Selecting attributes
# '//' : Root 이하에서 category가 있으면 모두
XML::xpathSApply(root, "//@category")

  category   category   category   category 
 "cooking" "children"      "web"      "web"

XML::xpathSApply(root, "//@lang")

lang lang lang lang 
"en" "en" "en" "en"

XML::xpathSApply(root, "//book/title", xmlGetAttr, 'lang')

[1] "en" "en" "en" "en"

# Selecting atomic values
xpathSApply(root, "//title", xmlValue)  # 모든 title의 value값값

[1] "Everyday Italian"  "Harry Potter"      "XQuery Kick Start"
[4] "Learning XML"

xpathSApply(root, "//title[@lang='en']", xmlValue)

[1] "Everyday Italian"  "Harry Potter"      "XQuery Kick Start"
[4] "Learning XML"

xpathSApply(root, "//book[@category='web']/price", xmlValue)

[1] "49.99" "39.95"

xpathSApply(root, "//book[price > 35]/title", xmlValue) # 35이상인 책의 title

[1] "XQuery Kick Start" "Learning XML"

xpathSApply(root, "//book[@category = 'web' and price > 40]/price", xmlValue) # 조건2개

[1] "49.99"

Part 2: 연구논문 (arXiv Papers)

참고 : arXiv Paper : 연구 논문을 공유하는 사이트
- Web Site : arXiv
- “text mining”논문을 조회하여 제목, 초록, .. 기타 정보를 스크램블 해보자
- URL 정보는 Page별로 마지막 start=?? 이 부분만 변경됨
chrome 에서 소스 보기
- ‘F12’ ==> 개발자 도구
- HTML 페이지에서 원하는 위치에서 마우스 오른쪽 ‘검사’ 선택하면 해당 되는 소스로 이동
- ‘toggle device toob bar’: ctrl + shift + m

if (!requireNamespace("dplyr")) install.packages("dplyr")
if (!requireNamespace("stringr")) install.packages("stringr")
if (!requireNamespace("httr")) install.packages("httr")
if (!requireNamespace("rvest")) install.packages("rvest")

library(dplyr)
library(stringr)  #string을 처리해주는 패키지
library(httr)    # web 접속
library(rvest)    #scraping을 지원하는 패키지

#"text minig"관련 처음 URL
url <- 'https://arxiv.org/search/?query=%22text+mining%22&searchtype=all&source=header&start=0'

#URL 구조를 알기 위하여
httr::parse_url(url)  #실제로 변경되어야 하는 값은 : $query$start

$scheme
[1] "https"

$hostname
[1] "arxiv.org"

$port
NULL

$path
[1] "search/"

$query
$query$query
[1] "\"text+mining\""

$query$searchtype
[1] "all"

$query$source
[1] "header"

$query$start
[1] "0"


$params
NULL

$fragment
NULL

$username
NULL

$password
NULL

attr(,"class")
[1] "url"

start <- proc.time()
title <- NULL
author <- NULL
subject <- NULL
abstract <- NULL
meta <- NULL

#pages <- seq(from = 0, to = 430, by = 50)
pages <- seq(from = 0, to = 100, by = 50)  #일 부분만 수행하자
# 아래 문을 수행하는 시간은 약 15분 정도 소요됨
for( i in pages){
  tmp_url <- httr::modify_url(url, query = list(start = i)) #url에서 start 부분만 수정 해라

  #p.list-title.is-inline-block  ==> 크롬에서 해당 패이지를 개발자 도구(F12)를 통하여 확인
  # <p class="list-title is-inline-block"><a href="https://arxiv.org/abs/2101.12177">arXiv:2101.12177</a>
  #      <span>&nbsp;[<a href="https://arxiv.org/pdf/2101.12177">pdf</a>]&nbsp;</span>
  #    </p>
  #  
  
  ## temp list에는 페이지의 링크을 가져온다
  read_html(tmp_url) %>% 
    html_nodes('p.list-title.is-inline-block') %>%    # 2번째 줄이 리턴  (중간의 점은 space을 표시, [title is])
    html_nodes('a[href^="https://arxiv.org/abs"]') %>%  # '['는 속성을 표시함
    html_attr('href') -> tmp_list     # URL만 50개 추출됨
  
# 각 링크의 세부 URL을 접속하여 데이터를 가져온다  [ 세부 정보: 각 논문의 정보를 Read]
#  for(j in 1:length(tmp_list)){
  for(j in 1:5){  # 일부만 수행행
    tmp_paragraph <- read_html(tmp_list[j])
    
    # title ==> 
    #<h1 class="title mathjax"><span class="descriptor">Title:</span>Conjoined Dirichlet Process</h1>
    tmp_title <- tmp_paragraph %>% html_nodes('h1.title.mathjax') %>% html_text(T)
    tmp_title <-  gsub('Title:', '', tmp_title) #[1] "Conjoined Dirichlet Process"
    title <- c(title, tmp_title)
    
    # author
    # <div class="authors"><span class="descriptor">Authors:</span>
    #   <a href="https://arxiv.org/search/stat?searchtype=author&amp;query=Ngo%2C+M+N">Michelle N. Ngo</a>, 
    #   <a href="https://arxiv.org/search/stat?searchtype=author&amp;query=Pluta%2C+D+S">Dustin S. Pluta</a>, 
    #   <a href="https://arxiv.org/search/stat?searchtype=author&amp;query=Ngo%2C+A+N">Alexander N. Ngo</a>, 
    #   <a href="https://arxiv.org/search/stat?searchtype=author&amp;query=Shahbaba%2C+B">Babak Shahbaba</a>
    #</div>
    tmp_author <- tmp_paragraph %>% html_nodes('div.authors') %>% html_text #"Authors:Michelle N. Ngo, Dustin S. Pluta, Alexander N. Ngo, Babak Shahbaba"
    # html_text ==> Text만 가져온다
    tmp_author <- base::gsub('\\s+',' ',tmp_author)  #space여러개를 하나로 통합, tab도 space로 변경해라
    tmp_author <- base::gsub('Authors:','',tmp_author) %>% str_trim  # str_trim: 안뒤 공백 제거
    author <- c(author, tmp_author)  
    
    # subject[주제]
    # <td class="tablecell subjects">
    #            <span class="primary-subject">Machine Learning (stat.ML)</span>; Machine Learning (cs.LG); Methodology (stat.ME)
    #</td>
    tmp_subject <- tmp_paragraph %>% html_nodes('span.primary-subject') %>% html_text(T) #"Machine Learning (stat.ML)"
    subject <- c(subject, tmp_subject)
    
    # abstract : 초록
    tmp_abstract <- tmp_paragraph %>% html_nodes('blockquote.abstract.mathjax') %>% html_text(T)
    tmp_abstract <- gsub('\\s+',' ',tmp_abstract)  # \n 도 space로 변경됨
    tmp_abstract <- sub('Abstract:','',tmp_abstract) %>% str_trim
    abstract <- c(abstract, tmp_abstract)
    
    # meta 
    tmp_meta <- tmp_paragraph %>% html_nodes('div.submission-history') %>% html_text
    #gsub('\\s+', ' ',tmp_meta)
    gsub('\\s+', ' ',tmp_meta) %>% 
      strsplit('[v1]', fixed = T) %>% 
      lapply('[',2) %>%  # 리스트에서 2번째 element을  [M[1, 2]   == `[`(M, 1, 2) ]
      unlist %>% 
      str_trim -> tmp_meta
#    tmp_meta <- lapply(strsplit(gsub('\\s+', ' ',tmp_meta), '[v1]', fixed = T),'[',2) %>% unlist %>% str_trim
    meta <- c(meta, tmp_meta)
#    cat(j, "paper\n")
    Sys.sleep(1)   # 너무 빨리 하면 서버에서 체크을 대비하여  1s Sleep
    
  }
#  cat((i/50) + 1,'/ 9 page\n')
  
}
papers <- data.frame(title, author, subject, abstract, meta)
end <- proc.time()
end - start # Total Elapsed Time

   user  system elapsed 
   0.98    0.05   37.73

# Export the result
save(papers, file = "Arxiv_Text_Mining.RData")
write.csv(papers, file = "scraping/Arxiv papers on Text Mining.csv")
### 결과파일을 꼭 Excel로 열어 보세요.

Part 3: 영화평점 (IMDB Top 50 Movies)

영화 리뷰 데이터 수집
- IMDB에서 영화 scraping…
- 영화제목, 연도 평균 평점, 작가,. Review ….

library(dplyr)
library(stringr)
library(httr)
library(rvest)

url <- 'https://www.imdb.com/search/title/?groups=top_250&sort=user_rating'

start <- proc.time()
imdb_top_50 <- data.frame()  #초기화
cnt <- 1   #수집하고 있는 영화 순번

#<h3 class="lister-item-header">
#        <span class="lister-item-index unbold text-primary">1.</span>
#    <a href="/title/tt0111161/?ref_=adv_li_tt">The Shawshank Redemption</a>
#    <span class="lister-item-year text-muted unbold">(1994)</span>
#</h3>
tmp_list <- read_html(url) %>% html_nodes('h3.lister-item-header') %>% 
  html_nodes('a[href^="/title"]') %>% html_attr('href')

for( i in 1:50){
  
  tmp_url <- paste('http://imdb.com', tmp_list[i], sep="") #"http://imdb.com/title/tt0111161/?ref_=adv_li_tt"
  tmp_content <- read_html(tmp_url)
  
  # Extract title and year
  #<div class="title_wrapper">
#     <h1 class="">The Shawshank Redemption&nbsp;<span id="titleYear">
#         (<a href="/year/1994/?ref_=tt_ov_inf">1994</a>)</span>            </h1>
#    <div class="subtext">15
#  ...
  title_year <- tmp_content %>% html_nodes('div.title_wrapper > h1') %>% html_text %>% str_trim
# 'div.title_wrapper > h1' ==>   "div.title_wrapper"을 찾고, 하위의 'h1'을 찾아라
  
  tmp_title <- substr(title_year, 1, nchar(title_year)-7)
  tmp_year <- substr(title_year, nchar(title_year)-4, nchar(title_year)-1)
  tmp_year <- as.numeric(tmp_year)
  
  # Average rating
#<div class="ratingValue">
#<strong title="9.3 based on 2,342,117 user ratings"><span itemprop="ratingValue">9.3</span></strong><span #class="grey">/</span><span class="grey" itemprop="bestRating">10</span>                    </div>  
  tmp_rating <- tmp_content %>% html_nodes('div.ratingValue > strong > span') %>% html_text
  tmp_rating <- as.numeric(tmp_rating)
  
  # Rating counts
  #<span class="small" itemprop="ratingCount">2,342,117</span>
  tmp_count <- tmp_content %>% html_nodes('span.small') %>% html_text
  tmp_count <- gsub(",", "", tmp_count)
  tmp_count <- as.numeric(tmp_count)
  
  # Summary
  tmp_summary <- tmp_content %>% html_nodes('div.summary_text') %>% html_text %>% str_trim
   
  # Director, Writers, and Stars
  # 3개의 값이 Tag로는 구분되지 않고, Text 값으로 구분됨
  
# <div class="plot_summary ">
# ...
#     <div class="credit_summary_item">
#         <h4 class="inline">Director:</h4>
# <a href="/name/nm0001104/?ref_=tt_ov_dr">Frank Darabont</a>    </div>
#     <div class="credit_summary_item">
#         <h4 class="inline">Writers:</h4>
# <a href="/name/nm0000175/?ref_=tt_ov_wr">Stephen King</a> (short story "Rita Hayworth and Shawshank Redemption"), <a href="/name/nm0001104/?ref_=tt_ov_wr">Frank Darabont</a> (screenplay)    </div>
#     <div class="credit_summary_item">
#         <h4 class="inline">Stars:</h4>

  
  tmp_dws <- tmp_content %>% html_nodes('div.credit_summary_item') %>% html_text
  tmp_director <- tmp_dws[1] %>% str_trim
  tmp_director <- sub("Director:\n", "", tmp_director)
  
  tmp_writer <- tmp_dws[2] %>% str_trim
  tmp_writer <- sub("Writers:\n", "", tmp_writer)
  
  tmp_stars <- tmp_dws[3] %>% str_trim
  tmp_stars <- strsplit(tmp_stars, "\nSee")[[1]][1]
  tmp_stars <- sub("Stars:\n", "", tmp_stars)
  tmp_stars <- substr(tmp_stars, 1, nchar(tmp_stars)-1) %>% str_trim

# 리뷰 확인
#리뷰의 URL 규칙을 확인  
  # Extract the first 25 reviews
  title_id <- strsplit(tmp_list[i], "/")[[1]][3]
  review_url <- paste("https://www.imdb.com/title/", title_id, "/reviews?ref_=tt_urv", sep="")
  tmp_review <- read_html(review_url) %>% html_nodes('div.review-container')
  
  
  for(j in 1:25){
#    cat("Scraping the", j, "-th review of the", i, "-th movie. \n")
    
    tryCatch({   #tryCatch({}, error = function(e){print("...")})
      
      # Review rating 
      tmp_info <- tmp_review[j] %>% html_nodes('span.rating-other-user-rating > span') %>% html_text
      tmp_review_rating <- as.numeric(tmp_info[1])  # 앞에 숫자만 필요
      
      # Review title
      tmp_review_title <- tmp_review[j] %>% html_nodes('a.title') %>% html_text
      tmp_review_title <- tmp_review_title %>% str_trim
      
      # Review text
      tmp_review_text <- tmp_review[j] %>% html_nodes('div.text.show-more__control') %>% html_text
      tmp_review_text <- gsub("\\s+", " ", tmp_review_text)
      tmp_review_text <- gsub("\"", "", tmp_review_text) %>% str_trim
      
      # Store the results
      imdb_top_50[cnt,1] <- tmp_title
      imdb_top_50[cnt,2] <- tmp_year
      imdb_top_50[cnt,3] <- tmp_rating
      imdb_top_50[cnt,4] <- tmp_count
      imdb_top_50[cnt,5] <- tmp_summary
      imdb_top_50[cnt,6] <- tmp_director
      imdb_top_50[cnt,7] <- tmp_writer
      imdb_top_50[cnt,8] <- tmp_stars
      imdb_top_50[cnt,9] <- tmp_review_rating
      imdb_top_50[cnt,10] <- tmp_review_title
      imdb_top_50[cnt,11] <- tmp_review_text
      
      cnt <- cnt+1
      }, error = function(e){print("An error occurs, skip the review")})
    }
  Sys.sleep(1) # Pretending not a bot
}

[1] "An error occurs, skip the review"
[1] "An error occurs, skip the review"

names(imdb_top_50) <- c("Title", "Year", "Avg.Rating", "RatingCounts", "Summary", "Director",
                        "Writer", "Stars", "Review.Rating", "Review.Title", "Review.Text")

end <- proc.time()
end - start # Total Elapsed Time

   user  system elapsed 
  27.21    1.29  288.26

# Export the result
#save(imdb_top_50, file = "imdb_top_50.RData")
write.csv(imdb_top_50 , file = "scraping/imdb_top_50.csv")

Part 4: 한글Page:ppomppu

한글 페이지 스크래핑 [www.ppomppu.co.kr]
- ‘보험포럼’ 10Page ..
처리 순서
- Page URL structure 알아내기 (각 Page에서 변하는 부분을 찾는다)
- 한글관련 encoding…

library(dplyr)
library(stringr)
library(httr)
library(rvest)

url <- 'http://www.ppomppu.co.kr/zboard/zboard.php?id=insurance&page='
start <- proc.time() # 
ppomppu_insurance <- data.frame()
Npost <- 1

# Extract the link of each post (for first 10 pages)
for( i in c(1:10)){  # Page
  
  tryCatch({
    
    tmp_url <- paste(url, i, '&divpage=13', sep="")
    # list0,1 으로 2개로 분리되어 있음
    tmp_list0 <- read_html(tmp_url) %>% html_nodes('tr.list0') %>% html_nodes('a') %>% html_attr('href')
    tmp_list1 <- read_html(tmp_url) %>% html_nodes('tr.list1') %>% html_nodes('a') %>% html_attr('href')
    tmp_list0 <- paste0('http://www.ppomppu.co.kr/zboard/',tmp_list0)
    tmp_list1 <- paste0('http://www.ppomppu.co.kr/zboard/',tmp_list1)
    tmp_list <- c(tmp_list0, tmp_list1)
    
    for(j in 1:length(tmp_list)){ # 한건의 문의 내용
      
#      cat("Processing ", j, "-th Post of ", i, "-th page \n", sep="")
      tryCatch({
        tmp_paragraph <- read_html(tmp_list[j])
        
        # title
#<font class="view_title2"><!--DCM_TITLE-->보험 설계 봐주세요.<!--/DCM_TITLE--></font>        
        tryCatch({
          tmp_title <- rvest::repair_encoding(tmp_paragraph %>% html_nodes('font.view_title2') %>% html_text(T))
        }, error = function(e){tmp_title <- NULL})
        
        # date
        
        tryCatch({
          tmp_date <- repair_encoding(tmp_paragraph %>% html_nodes('td.han') %>% html_text(T))[2]
          date_start_idx <- gregexpr(pattern ='등록일', tmp_date)[[1]][1]
          tmp_date <- substr(tmp_date, date_start_idx+5, date_start_idx+20)
        }, error = function(e){tmp_date <- NULL})
        
        # contents
        tryCatch({
          tmp_contents <- repair_encoding(tmp_paragraph %>% html_nodes('td.board-contents') %>% html_text(T))
          tmp_contents <- gsub("[[:punct:]]", " ", tmp_contents)   #문장기호 !.?  
          tmp_contents <- gsub("[[:space:]]", " ", tmp_contents)   #Space는 한칸 space
          tmp_contents <- gsub("\\s+", " ", tmp_contents)          #  
          tmp_contents <- stringr::str_trim(tmp_contents, side = "both")    # 양쪽 trim
        }, error = function(e){tmp_contents <- NULL})
        
        
## 답볍 내용을 추가해 보자-----
        tmp_comments <- tmp_paragraph %>% html_nodes('div.comment_wrapper')
  
        df_comment <- data.frame()
        
        for(k in 1:lengths(tmp_comments)[[1]]) {
#          cat("Comment ... " , k , "Scraping the", j, "-th review of the", i, "-th movie. \n")
            
          temp_str <- tmp_comments[k] %>% html_nodes('div.over_hide.link-point') %>% html_text
          base::gsub("[[:space:]]", " ", temp_str) %>% 
            base::gsub("\\s+", " ", .) %>% 
            stringr::str_trim(., side = "both") -> temp_str
#          cat('         -->   ' , temp_str, '\n')  
          df_comment[k,1] <-  temp_str
        }
        
        if (length(df_comment) > 0) {
          for (k in 1:lengths(df_comment)[[1]]) {
            ppomppu_insurance[Npost,1] <- tmp_title
            ppomppu_insurance[Npost,2] <- tmp_date
            ppomppu_insurance[Npost,3] <- tmp_contents
            ppomppu_insurance[Npost,4] <- df_comment[k,1]
            Npost <- Npost + 1
            
          }
          
        } else {
          ppomppu_insurance[Npost,1] <- tmp_title
          ppomppu_insurance[Npost,2] <- tmp_date
          ppomppu_insurance[Npost,3] <- tmp_contents
          Npost <- Npost + 1
        }
        
      }, error = function(e){print("Invalid conversion, skip the post")})
    }
  }, error = function(e){print("Invalid conversion, skip the page")})
}

[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"
[1] "Invalid conversion, skip the post"

end <- proc.time()
end - start # Total Elapsed Time

   user  system elapsed 
  18.10    1.12   73.76

# Export the result
write.csv(ppomppu_insurance, file = "scraping/ppomppu_insurance.csv")
## 결과파일을 꼭 Excel로 열어 보세요.

R_스크래핑 scraping[영화 평점, 게시판, 논문 초록]

고려대 스크램블 강의내용 정리

myinno

2021 1 30

개요

Part 1: XPath with XML

Part 2: 연구논문 (arXiv Papers)

Part 3: 영화평점 (IMDB Top 50 Movies)

Part 4: 한글Page:ppomppu