Take down the paywall!

I like to read the local news in the morning to find out what’s going on in my hometown right now. Besides the public announcements of the local authorities and utilities, there is a lot of news that is difficult or impossible to find on the public websites of the respective party. Unfortunately, this information is often hidden behind a paywall tied to a subscription, that is impossible to overcome at first glance. I do not want to subscribe, because most of the news does not interest me or does not meet my quality standards. Thankfully, there is SEO and the associated problems for a content provider like the Madsack Media Group to be found on the Internet via search engines…

If you want to skip the long part: robert/madsack-plus

It’s easy if you know how it works

The following function accepts a single parameter Link, which is expected to be a URL string.


def GetMadsackArticleFromURL(Link):
    LinkIsValid = validators.url(Link)
    if LinkIsValid == True:
        MadsackArticle = requests.get(Link)
        if MadsackArticle.status_code == 200:
            MadsackArticleJSON = json.loads(re.findall('Fusion.globalContent=(.*?);Fusion.globalContentConfig', MadsackArticle.text)[0])
            return(MadsackArticleJSON)
        else:
            return { 'error' : True }
    else:
        return { 'error' : True }

The function first checks whether the provided URL is valid using the validators.url method. If the URL is valid, the function uses the requests library to make a GET request to the provided URL and store the response in the MadsackArticle variable.

The function then checks the status code of the response. If the status code is 200 (OK), it extracts the JSON data from the response using a regular expression and the json library, and returns it as the output of the function.

More precise, the regex is used to extract a substring of text between the two literals Fusion.globalContent= and ;Fusion.globalContentConfig from a larger string, which is the HTML content of a web page. The substring between these two literals is expected to be a JSON object, which will be loaded into a Python dictionary using json.loads().

If the provided URL is not valid or the request to the URL fails, the function returns a dictionary with an error key set to True.

This JSON string contains all the data needed to read the whole article.

This way we can comfortably avoid the paywall. As far as my legal understanding may judge, this method is completely legal.