Reading website content with Python

Snippet: Get and process website using Python

With Python, web content such as Internet pages can be called up in order to read the content. In this way, websites can be monitored for functionality or specific events.

urllib

The urllib module is suitable for such tasks in Python. This allows resources to be called up from the Internet / network. urllib is part of the Python standard library, so no additional packages need to be installed.
With a normal import, the module can be imported during operation:

To call up a page you can use urllib as follows:

response = urllib.urlopen ('http://city-insider.de')

The request was sent to the URL http://city-insider.de. In the variable response the answer is saved. With a simple loop you can output the content of the HTML code on the screen:

for line in response: print line

To make the whole thing clearer, I have prepared a small practical example: problem: You want to read out the title (title tag) of a website automatically.
solution: A function is required.
parameter: Page url
Return value: Title tag

importurllib def website_title (url): response = urllib.urlopen (url) for l in response: if l.find ('') & gt; = 0: return l.strip () [line.find ('') + 7: l.find ('')] if __name__ == '__ main__': print website_title ('http://city-insider.de')

This is a very simple example, because it is assumed that the title tag is exactly as it is in the HTML. Since theoretically different spellings are possible in the title tag, this function will not work across the board for every URL. However, I think the example is sufficient for a rough illustration of the subject.

mechanize

Another, somewhat more convenient way of calling up websites with Python is the mechanize module. Mechanize is not part of the standard Python library. However, mechanize is part of the Python PackageIndey, so the package can be installed using setuptools. When mechanize is successfully installed on the system, just like urllib, it can be used in Python.

Mechnize is used in a similar way to urllib. To do this, it must also be imported at the beginning:

Then a browser must be instantiated in mechanize:

br = mechanize.Browser ()

A website can then be accessed via open ():

br.open ('city-insider.de')

The website can also be displayed line by line in the console with Mechanize:

response = br.open (url) for l in response.readlines (): print l

The title of the website could now be read out in the same way as in urllib. However, mechanize already has its own method on board that does this for us:

import mechanize def website_title2 (url): br = mechanize.Browser () br.open (url) return br.title () if __name__ == '__ main__': print website_title2 ('http://city-insider.de')

The result for both functions is the same. The mechanize variant is probably much harder, as it will work regardless of the spelling of a title tag. The effort for the developer is less, as the result does not have to be laboriously parsed. In mechanize there are still options for filling out and submitting forms, and links can also be clicked. Theoretically, entire click paths on a website can be conveniently scripted. I would write another article about the other functions in mechanize to go into a little more detail. This should be enough for a first overview.

Tags: retrieve, read out, function, mechanize, package, pypi, python, setuptools, snippet, urllib, website