Language/Python

Python 나무위키 데이터 가져오기 (Pandas, datasets, parquet)

뉴비뉴 2023. 6. 23.

안녕하세요.

오늘은 나무위키에 있는 '식품 관련 정보' 데이터를 가져오는 걸 구현 해보겠습니다.

https://huggingface.co/datasets/heegyu/namuwiki 를 참고 하였습니다.

datasets 을 설치하고, namuwiki 데이터 즉, parquet 데이터를 다운로드 받습니다.

$ pip install datasets

from datasets import load_dataset("heegyu/namuwiki")


dataset = load_dataset("heegyu/namuwiki")

수 많은 데이터가 존재하고, 거기서 본인이 원하는 데이터를 아래와 같이 검색하면 됩니다.

import pandas as pd


df = pd.read_parquet("/Users/user/Downloads/namuwiki_20210301.parquet")
filtered_df = df[df.str.contains("식품 관련 정보")]
filtered_df['text'].to_csv("/Users/user/Downloads/info.csv")

그러면 아래와 같은 데이터를 확인할 수 있습니다.

감사합니다.

참고

https://huggingface.co/datasets/heegyu/namuwiki

'Language > Python' 카테고리의 다른 글

'google-api-python-client'를 사용한 유튜브 데이터 가져오기 - pandas, argparse (2) (0)	2023.08.29
'google-api-python-client'를 사용한 유튜브 데이터 가져오기 - Google Cloud 및 Python 설정 가이드 (1) (0)	2023.08.24
[PostgreSQL] sorry, too many clients already 문제 해결 (0)	2023.02.23
Python 구글 뉴스 데이터 크롤링(apscheduler, nohup) - 2 (0)	2023.02.14
Python 텔레그램(telegram) 채널 데이터 가져오기, Django Create a Model- 2 (0)	2023.02.06

Python 나무위키 데이터 가져오기 (Pandas, datasets, parquet)

참고

'Language > Python' 카테고리의 다른 글

댓글

💲 추천 글

티스토리툴바