BeautifulSoup 사용 방법 및 웹 문서 스크랩핑

작성자ParkYK|작성시간15.08.24|조회수5,295 목록 댓글 0

참조 사이트 :

영문 https://www.crummy.com/software/BeautifulSoup/bs4/doc/

http://dplex.egloos.com/category/Python : BeautifulSoup example

http://lxml.de : lxml 라이브러리로 대량의 파일 처리 가능

--- BeautifulSoup 실행 전 준비작업 ---

1) 웹에서 자료 읽기에서 사용하는 대표적인 2가지 모듈

방법 a) requests 모듈 ~ >pip install requests

requests 모듈 활용 https://3.python-requests.org/

방법 b) urllib 모듈 https://docs.python.org/ko/3/library/urllib.request.html

2) BeautifulSoup을 설치 ~ >pip install beautifulsoup4 또는 pip install bs4

3) http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml 에서 각자 설치한 python 버전에 맞는 lxml 파일을 다운받아

압축을 푼 후 ~\Lib\site-packages에 해당 폴더를 붙여넣기 해 준다.

단, Anaconda를 설치했다면 위의 작업은 하지 않아도 된다.

▣ 표에 각 해석 라이브러리의 장점과 단점 요약 ▣

해석기 종류	사용 방법	장점	단점
html.parser	BeautifulSoup(markup, "html.parser") BeautifulSoup('<a></p>', 'html.parser') 하면 <a></a> 형태로 강제 변경되어 처리됨	각종 기능 완비 적절한 속도 관대함	별로 관대하지 않음
lxml	BeautifulSoup(markup, "lxml") BeautifulSoup('<a></p>', 'lxml') 하면 <html><body><a></a></body></html> 형태로 강제 변경되어 처리됨	아주 빠름 관대함	외부 C 라이브러리 의존
xml	BeautifulSoup(markup, "xml") BeautifulSoup('<a><b />', 'xml') 하면 <?xml version="1.0" encoding="utf-8" ?> <a><b/></a> 형태로 강제 변경되어 처리됨	아주 빠름 유일하게 XML 해석기 지원	외부 C 라이브러리 의존

일반적으로 html 파일인 경우에는 html.parser를 사용하며, 속도를 위해 lxml을 설치해 사용할 수도 있다.

* BeautifulSoup 모듈이 제공하는 find 함수 종류

- find()

- find_next()

- find_all()

1) 모든 a 태그 검색

soup.find_all("a")

soup("a")

2) string 이 있는 title 태그 모두 검색

soup.title.find_all(string=True)

soup.title(string=True)

3) p 태그를 두개만 가져옴

soup.find_all("p", limit=2)

4) string 검색

soup.find_all(string="Tom") # string이 Tom인 것 찾기

soup.find_all(string=["Tom", "Elsa", "Oscar"]) # or 검색

soup.find_all(string=re.compile("\d\d")) # 정규표현식 이용

5) p 태그와 속성 값이 title이 있는 것

soup.find_all("p", "title")

예)

6) a 태그와 b 태그 찾기

soup.find_all(["a", "b"])

7) 속성 값 가져오기

soup.p['class']

soup.p['id']

8) string을 다른 string으로 교체

tag.string.replace_with("새로운 값")

9) 보기 좋게 출력

soup.b.prettify()

10) 간단한 검색

soup.body.b # body 태그 아래의 첫번째 b 태그

soup.a # 첫번째 a 태그

11) 속성 값 모두 출력

tag.attrs

12) class는 파이썬에서 예약어이므로 class_ 로 쓴다.

soup.find_all("a", class_="sister")

13) find 할 때 확인

if soup.find("div", title=True) is not None:

i = soup.find("div", title=True)

14) data-로 시작하는 속성 find

soup.find("div", attrs={"data-value": True})

15) 태그명 얻기

soup.find("div").name

16) 속성 얻기

soup.find("div")['class'] # 만약 속성 값이 없다면 에러

soup.find("div").get('class') # 속성 값이 없다면 None 반환

17) 속성이 있는지 확인

tag.has_attr('class')

tag.has_attr('id') 있으면 True, 없으면 False

18) 태그 삭제

a_tag.img.unwrap()

19) 태그 추가

soup.p.string.wrap(soup.new_tag("b"))

soup.p.wrap(soup.new_tag("div")

* BeautifulSoup의 select 함수 종류

CSS의 셀렉터와 같은 형식을 사용한다.

1) select_one() : 결과를 하나만 반환

2) select() : select는 결과값이 복수이며 리스트 형태로 저장된다.

태그 내의 문장을 가져오는 방법에는

* string : .string 태그 하위에 문자열을 객체화. 문자열이 없으면 None 을 반환.

태그 내의 스트링. 주의! 내부에 순수하게 스트링만 존재해야함. 아니면 None. (태그가 있어도 안됨.)

* text 또는 get_text() : .text는 하위 자식태그의 텍스트까지 문자열로 반환. (유니코드 형식)

즉, 하위태그에 텍스트까지 문자열로 파싱할 경우 .text를 사용하는 것이 좋다.

string의 경우 문자열이 없으면 None을 출력하지만, get_text()의 경우 유니코드 형식으로 텍스트까지 문자열로 반환 하기 때문에 아무 정보도 출력되지 않는다.

- 태그를 제외한 텍스트만 출력하는 함수는 get_text()이다.

- string 은 태그가 하나밖에 없을 때만 동일한 결과를 출력한다.

** 아래 소스 코드는 2021년 11월 현재 가능 코드임 - contents가 계속 변화함을 잊지 말자. **

간단 예제1)

import requests

from bs4 import BeautifulSoup

def go():

base_url = "http://www.naver.com:80/index.html"

#storing all the information including headers in the variable source code

source_code = requests.get(base_url)

#sort source code and store only the plaintext

plain_text = source_code.text

#converting plain_text to Beautiful Soup object so the library can sort thru it

convert_data = BeautifulSoup(plain_text, 'lxml')

for link in convert_data.findAll('a'):

href = base_url + link.get('href') #Building a clickable url

print(href) #displaying href

go()

실행 결과
http://www.naver.com/index.html#newsstand http://www.naver.com/index.html#themecast http://www.naver.com/index.html#timesquare
...

간단 예제2) 정규표현식 사용

from bs4 import BeautifulSoup

import re

html = '''<!DOCTYPE html>

<html>

<head><title>story</title></head>

<body>

BeautifulSoup Test

Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" class="sister brother" id="link3">Tillie</a>;

and they lived at the bottom of a well.

</body>

</html>

'''

soup = BeautifulSoup(html, 'html.parser')

ele = soup.find('a', {'href':re.compile('.*/lacie')})

print(ele) # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

ele = soup.find(href=re.compile('.*/lacie'))

print(ele) # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

ele = soup.find('a', {'href':lambda val: val and 'lacie' in val})

print(ele) # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

ele = soup.find(href=lambda val: val and 'lacie' in val)

print(ele) # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

CAFE

Python

BeautifulSoup 사용 방법 및 웹 문서 스크랩핑

댓글

카페 검색