'분류 전체보기' 카테고리의 글 목록 (10 Page)

분류 전체보기

[Crawling] 실제 사이트로 응용 / br 태그가 있는것만 출력하기 / br 태그를 제거하기 2021.06.08
[Crawling] 요소에 접근하기 2021.06.08
[Crawling] 부모 태그 접근하기 / sibling / generator 2021.06.08
[Crawling] Generator 만들기 / 자바스크립트 yield와 비교 2021.06.08
[Crawling] iterator 만들기 2021.06.08
[Crawling] get,[] 차이 / 반복문 이용 2021.06.08
[파이썬] bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library? 2021.06.08 1
[crawling] requests vs urllib / 파싱모듈 2021.06.08
[Crawling] 크롤링 get요청, post요청하기 2021.06.07
[Crawling] 데이터 보내는 방법 2021.06.07

[Crawling] 실제 사이트로 응용 / br 태그가 있는것만 출력하기 / br 태그를 제거하기

2021. 6. 8. 12:51

728x90

res=rq.get("https://www.naver.com")
html = res.text
print(res.text)

soup = BeautifulSoup(html,'lxml')

soup.find_all('p')

list 타입으로 바꾸기

list_p = list(soup.find_all('p'))
list_p

type(list_p) #list

for i,j in enumerate(list_p): #숫자를 줌
    print(i,j)

문제) 위 출력결과에서 br 태그가 있는것만 출력하기

strre=[]

for i,j in enumerate(list_p): #숫자를 줌
    if 'br' in str(j):
        print(i)
        strre.append(j)

문제 ) br 태그를 제거하기

replace를 이용한다.

remover=[]
for i in strre:
    temp = str(i)
    temp=temp.replace('<br/>',' ')
    remover.append(temp)

728x90

[Crawling] 요소에 접근하기

2021. 6. 8. 12:00

728x90

* 본 포스팅은 주피터 노트북에서 진행하였다.

요소는 태그도 포함하지만 태그로 감싼 문자열로 포함한다.

html = """<html> <head><title>test site</title></head> <body> <p><a>test1</a><b>test2</b><c>test3</c></p> </body></html>"""
soup = BeautifulSoup(html,'lxml')
tag_a=soup.a
tag_a #<a>test1</a>
teg_a_nexts = tag_a.next_siblings
tag_a #<a>test1</a>

tag_a = soup.a
tag_a_nexts = tag_a.next_elements

for i in tag_a_nexts:
    print(i)

이전 sibling

tag_p_nexts = soup.p.next_elements

print(soup.prettify())
print('**elements**')
for i in tag_p_nexts:
    print(i)

원하는 요소에 정확히 접근하기

find_all을 사용하면된다.

title 태그의 모든것을 가져와라

print(soup.find_all('title'))

title 태그의 모든것을 가져와라

print(soup.find_all('p'))

p 태그의 모든것을 가져와라

id = 를 이용하여 원하는 id값을 가진 태그를 가져올 수 있고 id값은 해당 페이지에 한번만 사용하므로 하나 또는 빈 리스트 출력된다.

html = """<html> <head><title>test site</title></head> <body> <p>test1</p><p id="d">test2</p><p>test3</p></p> </body></html>"""
soup = BeautifulSoup(html,'lxml')

print(soup.find_all(id=True))

print(soup.body.find_all(id=False))

728x90

'Data Analysis > web crawling' 카테고리의 다른 글

[Crawling] 정규식을 이용한 bs4 고급 스킬 / 정규식 정리 / match와 search 비교 (0)	2021.06.08
[Crawling] 클래스 속성을 이용해 태그 가져오기 / find, limit, extract (0)	2021.06.08
[Crawling] 부모 태그 접근하기 / sibling / generator (0)	2021.06.08
[Crawling] Generator 만들기 / 자바스크립트 yield와 비교 (0)	2021.06.08
[Crawling] iterator 만들기 (0)	2021.06.08

[Crawling] 부모 태그 접근하기 / sibling / generator

2021. 6. 8. 11:43

728x90

* 본 포스팅은 주피터 노트북에서 진행하였다.

tag_span = soup.span
tag_title = soup.title
span_parent = tag_span.parent
title_parent = tag_title.parent

print(tag_span)
print(tag_title)

print(span_parent)
print(title_parent)

span_parents = tag_span.parents
title_parents = tag_title.parents

print(span_parents)
print(title_parents)

generator라 반복문을 쓸 수 있다.

for i in span_parents:
    print(i)

text만 골라 출력이 가능하다.

for i in title_parents:
    print(i.text)

sibling 관계(형제)

html = """<html> <head><title>test site</title></head> <body> <p><a>test1</a><b>test2</b><c>test3</c></p> </body></html>"""
soup = BeautifulSoup(html,'lxml')

tag_a=soup.a
tag_a #<a>test1</a>

tag_b=soup.b
tag_b #<a>test2</a>

tag_c=soup.c
tag_c #<a>test3</a>

teg_a_nexts = tag_a.next_siblings
tag_a #<a>test1</a>

tag_a_prevs = tag_a.previous_siblings
tag_a_prevs #<generator object PageElement.previous_siblings at 0x00000243F01AD120>

for sibling in teg_a_nexts:
    print(sibling)

728x90

'Data Analysis > web crawling' 카테고리의 다른 글

[Crawling] 클래스 속성을 이용해 태그 가져오기 / find, limit, extract (0)	2021.06.08
[Crawling] 요소에 접근하기 (0)	2021.06.08
[Crawling] Generator 만들기 / 자바스크립트 yield와 비교 (0)	2021.06.08
[Crawling] iterator 만들기 (0)	2021.06.08
[Crawling] get,[] 차이 / 반복문 이용 (0)	2021.06.08

[Crawling] Generator 만들기 / 자바스크립트 yield와 비교

2021. 6. 8. 11:24

728x90

* 본 포스팅은 주피터 노트북에서 진행하였다.

def test_generator():
    yield 1
    yield 2
    yield 3

gen = test_generator()
type(gen)

next(gen) #1
next(gen) #2

for i in test_generator():
    print(i)

def test_generator():
    print('yield 1 전')
    yield 1
    print('yield 1 과 2 사이')
    yield 2
    print('yield 2 과 3 사이')
    yield 3
    print('yield 3 후')

for i in test_generator():
    print(i)

yiled를 만나면 반환되지만 내용은 유지가 된다. 양보느낌

yiled를 보면 generator라고 생각하면 된다.

무한으로 generator 생성하기

def infinite_generator():
    count=0
    while True:
        count+=1
        yield count
        
gen = infinite_generator()

next(gen)

계속 누를 수록 숫자가 증가한다.

우리가 알고있는 리스트, Set, Dictionary의 표현식의 내부도 사실 generator 이다.

[x *x for x in [2,4,6]]
#[4, 16, 36]

print(type(x*x for x in [2,4,6]))
#<class 'generator'>

자바스크립트의 yield와 비교해보기

visual Studio Code 에서 진행하였다.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <script>
        function* idMaker(){ //자바스크립트 generator
            var index=0;
            shile(index<3)
                yield index++
        }
        var gen = idMaker();
        console.log(gen.next().value)
        console.log(gen.next().value)
        console.log(gen.next().value)
    </script>
    <title>Document</title>
</head>
<body>
    
</body>
</html>

728x90

'Data Analysis > web crawling' 카테고리의 다른 글

[Crawling] 요소에 접근하기 (0)	2021.06.08
[Crawling] 부모 태그 접근하기 / sibling / generator (0)	2021.06.08
[Crawling] iterator 만들기 (0)	2021.06.08
[Crawling] get,[] 차이 / 반복문 이용 (0)	2021.06.08
[crawling] requests vs urllib / 파싱모듈 (0)	2021.06.08

[Crawling] iterator 만들기

2021. 6. 8. 10:55

728x90

class IterClass(object):
    def __init__(self,start,last):
        self.currnet = start
        self.max = last
    
    def __iter__(self): #없으면 'object is not iterable' 예외 발생
        return self
    
    def __next__(self):
        if self.currnet > self.max: #current의 값이 __next__ 호출시 마다 1씩 증가되고 10이 되면 여기 if문에 도달하여 예외 발생됨
            raise StopIteration
        else:
            self.currnet += 1
            return self.currnet -1 #보여주기

n_list1 = IterClass(1,10)

type(n_list1)

n_list1.__next__() #1
n_list1.__next__() #2
n_list1.__next__() #3

for 문으로 바꾸면

for i in range(0,10):
    print(n_list1.__next__())

배열이나 tuple list는 iterable 가능한 객체이다.

내부적으로자동으로 advancedfor 같은 구문에서 next가 자동으로 호출된다.

마지막에 도달하면 종료되는 것이다.

728x90

'Data Analysis > web crawling' 카테고리의 다른 글

[Crawling] 부모 태그 접근하기 / sibling / generator (0)	2021.06.08
[Crawling] Generator 만들기 / 자바스크립트 yield와 비교 (0)	2021.06.08
[Crawling] get,[] 차이 / 반복문 이용 (0)	2021.06.08
[crawling] requests vs urllib / 파싱모듈 (0)	2021.06.08
[Crawling] 크롤링 get요청, post요청하기 (0)	2021.06.07

[Crawling] get,[] 차이 / 반복문 이용

2021. 6. 8. 10:51

728x90

* 본 포스팅은 주피터 노트북에서 진행하였다.

from bs4 import BeautifulSoup
html = """<html> <head><title class="t" id="ti">test site</title></head> <body> <p>test</p> <p>test1</p> <p>test2</p> </body></html>"""
soup = BeautifulSoup(html,'lxml')
tag_title = soup.title
print(tag_title['class'])

tag_title.get('class') #get으로 class의 속성을 가져와라

#둘다 같은 결과

만약 오류가 뜬다면 이대로 입력해주자! 주피터 노트북에서는 코드 앞에 !을 붙이면 된다. cmd로 할 경우 !을 빼면 된다.

!pip install beautifulsoup4

!pip install lxml

tag_title.attrs #attribute

이 둘의 차이는

tag_title.get('class1') #속성이 없는 클래스 호출할 때 get은 오류안뜸

tag_title['class1'] #오류뜸

tag_title.get('class1',default="hi") #값이 없을 때

data_text = tag_title.text
data_text

data_text = tag_title.string
data_text

같은 값을 가져오지만 타입이 다르다

data_text = tag_title.text
data_string = tag_title.sring
print("text : ",data_text, type(data_text))
print('string : ',data_string, type(data_string))

tag_p = soup.p
tag_p

data_text = tag_p.text
data_string = tag_p.string
print('text : ',data_text, type(data_text))
print('string : ',data_string, type(data_string))

html = """<html> <head><title>test site</title></head> <body> <p><span>test1</span><span>test2</span></p> </body></html>"""
soup = BeautifulSoup(html,'lxml')
tag_p = soup.p
tag_p

data_text = tag_p.text
data_string = tag_p.string
print('text : ',data_text, type(data_text))
print('string : ',data_string, type(data_string))

조건문을 활용하여 데이터를 확인할 수 있다. 아래의 코드는 span태그가 있는지의 여부를 묻는다.

if tag_p.span.string !=None:
    print('있다')

contents 속성과 children 속성을 이용하여 자식태그를 가져올 수 있다.

contents 속성을 이용하여 list 형태로 자식 태그를 가져온다.

tag_p_children = soup.p.contents
print(tag_p_children)

tag_p_children = soup.p.children
tag_p_children #iterate

문제 ) 반복문을 이용하여 둘다 출력하기

예시

a_tuple = (1,2,3)
b_iterator = iter(a_tuple)
print(b_iterator.__next__())
print(b_iterator.__next__())
print(b_iterator.__next__())

이것을 응용하자

tag_p_contents = soup.p.contents
tag_p_contents #[<span>test1</span>, <span>test2</span>]

tag_p_children = soup.p.children
tag_p_children # <list_iterator at 0x243f02ea220>

for i in tag_p_contents:
    print(i)

for i in tag_p_children:
    print(i)

728x90

'Data Analysis > web crawling' 카테고리의 다른 글

[Crawling] Generator 만들기 / 자바스크립트 yield와 비교 (0)	2021.06.08
[Crawling] iterator 만들기 (0)	2021.06.08
[crawling] requests vs urllib / 파싱모듈 (0)	2021.06.08
[Crawling] 크롤링 get요청, post요청하기 (0)	2021.06.07
[Crawling] 데이터 보내는 방법 (0)	2021.06.07

[파이썬] bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

2021. 6. 8. 00:54

728x90

웹 크롤링의 파싱을 배우는 도중 에러가 터졌다.

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

이런 오류가 떴다. 분명... bs4는 파이썬에 내장되어 설치할게 없는걸로 아는데 되게 난감했다.

방법은 lxml을 설치했냐고 묻고있으니 설치해주면 된다.

pip install lxml

이렇게 치고 주피터 노트북을 새로고침해도 똑같은 오류가 떴었다.

방법은 주피터 노트북을 완전히 끄고 다시 실행시키면 될 것이다!

728x90

'Tip' 카테고리의 다른 글

[mybatis] Parameter not found. Available parameters are [arg1, arg0, param1, param2] (0)	2021.06.20
[Django] python manage.py makemigrations 에러 no changes detected (0)	2021.06.10
[React] Cannot find module 'socket.io' 해결법 - 2 (0)	2021.06.07
[React] No 'Access-Control-Allow-Origin' header is present on the requested resource. 해결방법 / server socket 연결하기 (0)	2021.06.05
[React] Cannot find module 'socket.io' 해결법 -1 (0)	2021.06.05

[crawling] requests vs urllib / 파싱모듈

2021. 6. 8. 00:44

728x90

* 본 포스팅은 주피터 노트북에서 진행하였다.

requests vs urllib

1. requests와 urllib 요청시 요청 객체를 만드는 방법에 차이가 있다.

2. 데이터를 보낼 때 requests는 딕셔너리 형태로 urllib는 인코딩하여 binary 형태로 전송한다.

3. requests는 요청 메소드(get, post)를 명시하지만 urllib는 데이터의 여부에 따라 get요청, post요청을 구분한다.

4. 없는 페이지 요청시 requests는 에러를 띄우지 않지만 urllib는 에러를 띄운다.

파싱모듈

요청 모듈로 가져온 html 코드를 파이썬이 쓸수있는 코드로 변환해야 함 bs4 모듈을 이용하여 html 코드를 파이썬에서 사용가능한 객체로 바꿔줄 수 있다. bs4는 파이썬 내장 모듈이므로 설치가 필요하지 않다.

파서의 종류

lxml
html5lib
html.parser

크롤러를 만드는데 필요한 선행지식 및 필요한 프로그램 설치 및 문법 습득 requests + bs4 + selenium 이용하여 진행

crawling(긁어온다)

crawler

from bs4 import BeautifulSoup
html = """<p>test</p> """
soup=BeautifulSoup(html,"lxml")
soup

soup=BeautifulSoup(html,"html5lib")
soup

html = """<html> <head><title>test site</title> </head> <body> <p>test</p> </body></html>"""
soup=BeautifulSoup(html,"lxml")
soup

결과를 보기 편하게 하는 방법도 있다. (prettify())

print(soup.prettify())

tag_title=soup.title
print(type(soup),',',type(tag_title))

tag_title.string
tag_title.text
# 같은 결과 나옴

tag_title.name

html = """<html> <head><title class="t" id="ti">test site</title></head> <body> <p>test</p> <p>test1</p> <p>test2</p> </body></html>"""
tag_title = soup.title
print(tag_title.attrs)

print(tag_title['class'])

print(tag_title['id'])

728x90

'Data Analysis > web crawling' 카테고리의 다른 글

[Crawling] iterator 만들기 (0)	2021.06.08
[Crawling] get,[] 차이 / 반복문 이용 (0)	2021.06.08
[Crawling] 크롤링 get요청, post요청하기 (0)	2021.06.07
[Crawling] 데이터 보내는 방법 (0)	2021.06.07
[Crawling] html코드를 가져오기 (0)	2021.06.07

[Crawling] 크롤링 get요청, post요청하기

2021. 6. 7. 17:21

728x90

* 본 포스팅은 주피터 노트북에서 진행하였다.

포스트 요청시 보낼 데이터 만들기

data = dict1 = {"key1":"hong","key2":"icebear"}
data = urllib.parse.urlencode(data)
data=data.encode('utf-8')
data

Post 요청하기

req_post = Request(url, data=data, headers={}) #2번째 인자 : data, 3번째 인자 : header

page=urlopen(req_post)
page

Get 요청하기

req_get = Request(url+"?key1=values1&key2=values2",None, headers={}) #2번째 인자 : data, 3번째 인자 : header

page=urlopen(req_get)
print(page)

data를 만들때는 encode 함수를 이용하여 바이너리 형태로 인코딩하여 전송하여야 한다.

728x90

'Data Analysis > web crawling' 카테고리의 다른 글

[Crawling] get,[] 차이 / 반복문 이용 (0)	2021.06.08
[crawling] requests vs urllib / 파싱모듈 (0)	2021.06.08
[Crawling] 데이터 보내는 방법 (0)	2021.06.07
[Crawling] html코드를 가져오기 (0)	2021.06.07
[Crawling] 파이썬 크롤링 시작하기 (0)	2021.06.07

[Crawling] 데이터 보내는 방법

2021. 6. 7. 17:15

728x90

* 본 포스팅은 주피터 노트북에서 진행되었다.

url = "https://hello-ming.tistory.com/"

res=rq.get(url,params={"key":"홍길동","key1":"홍말자","key2":"김개똥"})
res.url #한글은 꺠진다.

res=rq.get(url,params={"key":"hong","key1":"malga","key2":"hi"})
res.url

url = "https://hello-ming.tistory.com//?key=hong&key1=malga"
res.url #위와 같은방식인데 이거는 오타날 확률이 있음

Post로 데이터 보내기

url = "https://hello-ming.tistory.com/"
res=rq.post(url, data={"key1":"hong","key2":"icebear"})
res.url #post로 body에  싣어 보냄

dict1 = {"key1":"hong","key2":"icebear"}

import json
json.dumps(dict1) # '{"key1": "hong", "key2": "icebear"}'
str(dict1) # '{"key1": "hong", "key2": "icebear"}'

둘다 문자열 형태로 감 json 형태 유지의 차이

from urllib.request import urlopen,Request
req = Request(url)
page = urlopen(req)
page

728x90

'Data Analysis > web crawling' 카테고리의 다른 글

[crawling] requests vs urllib / 파싱모듈 (0)	2021.06.08
[Crawling] 크롤링 get요청, post요청하기 (0)	2021.06.07
[Crawling] html코드를 가져오기 (0)	2021.06.07
[Crawling] 파이썬 크롤링 시작하기 (0)	2021.06.07
[Crawling] 크롤링 시작 / url을 html파일로 바꾸기 (1)	2021.06.04

PREV 1 ···7 8 9 10 11 12 13 ···18 NEXT

아이스베어의 개발 일기