Reading website content with Python
Snippet: Get and process website using Python
With Python, web content such as Internet pages can be called up in order to read the content. In this way, websites can be monitored for functionality or specific events.
The urllib module is suitable for such tasks in Python. This allows resources to be called up from the Internet / network. urllib is part of the Python standard library, so no additional packages need to be installed.
With a normal import, the module can be imported during operation:
To call up a page you can use urllib as follows:
The request was sent to the URL http://city-insider.de. In the variable response the answer is saved. With a simple loop you can output the content of the HTML code on the screen:
To make the whole thing clearer, I have prepared a small practical example: problem: You want to read out the title (title tag) of a website automatically.
solution: A function is required.
parameter: Page url
Return value: Title tag
This is a very simple example, because it is assumed that the title tag is exactly as it is in the HTML. Since theoretically different spellings are possible in the title tag, this function will not work across the board for every URL. However, I think the example is sufficient for a rough illustration of the subject.
Another, somewhat more convenient way of calling up websites with Python is the mechanize module. Mechanize is not part of the standard Python library. However, mechanize is part of the Python PackageIndey, so the package can be installed using setuptools. When mechanize is successfully installed on the system, just like urllib, it can be used in Python.
Mechnize is used in a similar way to urllib. To do this, it must also be imported at the beginning:
Then a browser must be instantiated in mechanize:
A website can then be accessed via open ():
The website can also be displayed line by line in the console with Mechanize:
The title of the website could now be read out in the same way as in urllib. However, mechanize already has its own method on board that does this for us:
The result for both functions is the same. The mechanize variant is probably much harder, as it will work regardless of the spelling of a title tag. The effort for the developer is less, as the result does not have to be laboriously parsed. In mechanize there are still options for filling out and submitting forms, and links can also be clicked. Theoretically, entire click paths on a website can be conveniently scripted. I would write another article about the other functions in mechanize to go into a little more detail. This should be enough for a first overview.
- How would you describe the Canadian mentality
- What does 91 octane gasoline mean?
- Is baby food vegan
- Why do we use Facebook Pixel
- Intel Pentium is a dual core processor
- Why can't a lawyer stay happy?
- Can I get a contract marriage here
- What is a Clemmensen reduction
- Walt Disney had ADHD
- How can I monetize game apps
- What is the concept behind series connections
- What are some new library concepts
- Why is the number 7 available everywhere
- What is the area code for NYC
- Hospitals share information about other patients
- Is insurance required for solo travelers
- What outdoor car covers are good
- Why do lumberjacks wear padded shirts?
- What are some stretching exercises
- Is the price of Home Depot USA the same as that of Amazon
- Canada has electricity
- You can change Cortana's voice
- Who is Veda Vyasa
- Where are submarine cables connected