Introduction to Web Scraping with Python

6 minutes
1 year, 1 month ago
<h2><b>Overview</b></h2><div><p>1. Extraction of data from websites</p><p>2. Introduction to BeautifulSoup<br></p></div><h2><b>Introduction</b></h2><p>There is more information on the Internet than any human can absorb in a lifetime. What you need is not access to that information, but a scalable way to collect, organize, and analyze it.</p><p>The way to do it is something called "Web Scraping", and Python allows you to do it easily, and flexibly.</p><p>In this tutorial, we'll learn how to scrap a basic web page, but the lessons learned will help you to scrape almost any website out there.</p><h2><b class="">Pre-requisites</b></h2><p>Make sure you have Python installed on your system. We'll be using Python 3 for the purposes of this tutorial but using Python 2 should work with minimal changes.</p><p>We'll also need 2 external libraries for this tutorial : requests and BeautifulSoup</p><p>Install them using the commands :</p><pre>pip install requests<br>pip install bs4</pre><p>The basics steps that have to be performed to Scrape a website are :</p><p>1. Get the source of the Website. The 'requests' library allows us to do that</p><p>2. Use the inspect feature of your web browser to find the element that contains the data you require</p><p>3. Extract the contents using BeautifulSoup</p><p>Let scrape an example webpage - http://www.moneycontrol.com/indian-indices/nifty-50-9.html</p><h2><b>Content</b><br></h2><h3><b>Let's write some code!</b></h3><p>Once you have the necessary libraries installed, lets import them in our python script.</p><pre>import requests<br>from bs4 import BeautifulSoup</pre><p>Now, let's get the source of the webpage</p><pre>url = "http://www.moneycontrol.com/indian-indices/nifty-50-9.html"<br>r = requests.get(url)</pre><p>Next we'll convert this raw HTML into a BeautifulSoup object. This will allows us to easily search for divs, spans, etc. in the page source.</p><pre>soup = BeautifulSoup(r.text, "lxml")</pre><p>If you open the page in the browser, you'll see that the content we want is inside a <b>'div'</b> of <b>'class=FL gr_35'</b></p><p>We'll make use of these properties to extract the data inside this div.</p><pre>div = soup.find('div', {'class' : 'FL gr_35'})</pre><p>If you print the contents of 'div', you'll get the div exactly like you can see in your browser.</p><p>Finally, you can extract the value of Nifty 50 Price as follows :</p><pre>strong_tag = div.strong<br>nifty_price = strong_tag.text</pre><p><img src="https://i.imgur.com/SZqADuH.png"><br></p><h3><b>Conclusion</b></h3><p>And there you have it!<br></p><p>A basic introduction to Web Scraping with Python. Remember, it is the combination of above mentioned 3 steps that will help you to scrape data from all kinds of simple or complex web pages.</p><p><br></p>

Comments

You must login to comment