Table of Contents
Basic Knowledge of Wikipedia Data Scraping using Python
import requests:
This line imports therequests
library, commonly used for making HTTP requests in Python.endpoint = "https://en.wikipedia.org/w/api.php":
This sets the endpoint URL for Wikipedia’s API.params = {...}:
This dictionary contains parameters sent with the API request, specifying the action as “query,” the format as JSON.response = requests.get(endpoint, params=params):
This line sends a GET request to the Wikipedia API with the specified parameters and stores the response in theresponse
variable.data = response.json():
Converts the JSON response from the API into a Python dictionary.page_data = data["query"]["pages"]:
Extracts relevant information about the page from the API response.- Now you can explore advanced options.
Fetching Contents from a Wikipedia Page using Wikipedia API
This Python script utilizes the
requests
library to retrieve content from the “2023 Cricket World Cup” from “Wikipedia’s API“.
Code
import requests
endpoint = "https://en.wikipedia.org/w/api.php"
params = {
"action": "query",
"format": "json",
"prop": "extracts",
"exintro": True,
"explaintext": True,
"titles": "2023_Cricket_World_Cup"
}
response = requests.get(endpoint, params=params)
data = response.json()
page_data = data["query"]["pages"]
page_id = list(page_data.keys())[0]
page_info = page_data[page_id]
page_title = page_info["title"]
page_extract = page_info["extract"]
print(f"Title: {page_title}")
print(f"Extract:\n{page_extract}")
Output
Title: 2023 Cricket World Cup Extract: The 2023 Cricket World Cup, officially known as the 2023 ICC Men's Cricket World Cup, was the 13th edition of the Cricket World Cup. It started on 5 October and concluded on 19 November 2023, with Australia winning the tournament. A quadrennial One Day International (ODI) cricket tournament contested by national teams, it was organised by the International Cricket Council (ICC). Ten national teams participated in the tournament. It was the first men's Cricket World Cup which India hosted solely. The tournament took place in ten different stadiums, in ten cities across the country. In the first semi-final India beat New Zealand, and in the second semi-final Australia beat South Africa. The final took place between India and Australia at Narendra Modi Stadium on 19 November with Australia winning the title for the sixth time.The top eight placed teams in the tournament's final points table qualified for the 2025 ICC Champions Trophy, the next ICC ODI tournament. Virat Kohli was the player of the tournament and also scored the most runs; Mohammed Shami was the leading wicket-taker. A total of 1,250,307 spectators attended matches, the highest number in any cricket World Cup to date.
Fetching Categories of a Wikipedia Page using Wikipedia API
This Python script will fetch Category Information using the Wikipedia API for the “2023_Cricket_World_Cup” page.
Code:
import requests
endpoint = "https://en.wikipedia.org/w/api.php"
params = {
"action": "query",
"format": "json",
"prop": "categories",
"titles": "2023_Cricket_World_Cup"
}
response = requests.get(endpoint, params=params)
data = response.json()
page_data = data["query"]["pages"]
page_id = list(page_data.keys())[0]
page_info = page_data[page_id]
categories = page_info.get("categories", [])
category_names = [category["title"] for category in categories]
print(f"Categories: {', '.join(category_names)}")
Output :
Categories: Category:2023 Cricket World Cup, Category:2023 in Indian sport, Category:2023 in cricket, Category:Articles with excerpts, Category:Articles with short description, Category:Cricket World Cup tournaments, Category:Cricket events postponed due to the COVID-19 pandemic, Category:International cricket competitions in India, Category:November 2023 sports events in India, Category:October 2023 sports events in India
Fetching Pageviews of Wikipedia Page using Wikipedia API
This Python script will fetch Pageview data using the Wikipedia API for the “Lakshadweep” page.
Code
import requests
endpoint = "https://en.wikipedia.org/w/api.php"
params = {
"action": "query",
"format": "json",
"prop": "pageviews",
"titles": "Lakshadweep",
"pvipdays": 10 # Number of days for which you want pageview data
}
response = requests.get(endpoint, params=params)
data = response.json()
page_data = data["query"]["pages"]
page_id = list(page_data.keys())[0]
page_info = page_data[page_id]
page_title = page_info["title"]
page_views = page_info["pageviews"]
print(f"Title: {page_title}")
print("Pageviews:")
for date, views in page_views.items():
print(f"{date}: {views}")
Output :
Date Pageviews
0 2024-01-13 34439
1 2024-01-14 29359
2 2024-01-15 22904
3 2024-01-16 17460
4 2024-01-17 12851
5 2024-01-18 12598
6 2024-01-19 9729
7 2024-01-20 8068
8 2024-01-21 8234
9 2024-01-22 6569
Fetching Associated Wikipedia Pages using Wikipedia API
This Python script will be used to perform a Search on Wikipedia for pages related to “Domestic Violence”
Code
import requests
endpoint = "https://en.wikipedia.org/w/api.php"
search_term = "domestic violence"
params = {
"action": "query",
"format": "json",
"list": "search",
"srsearch": search_term
}
response = requests.get(endpoint, params=params)
data = response.json()
search_results = data["query"]["search"]
print("Wikipedia pages related to 'domestic violence':")
for result in search_results:
page_title = result["title"]
print(page_title)
Output
Wikipedia pages related to 'domestic violence':
Domestic violence
Islam and domestic violence
Domestic violence in lesbian relationships
Domestic violence against men
Domestic violence in India
Christianity and domestic violence
Domestic violence in Russia
Domestic violence in same-sex relationships
Domestic violence in China
Epidemiology of domestic violence