Chrome Extension Messaging and Python HTML Parser APP

OK so have looked at the chrome documentation a little more, and have now got and understand the messaging going between the content, background and popup scripts.

Basically to send a message to content.js, we use chrome.tabs.sendMessage(), and to send a message to the popup or background scripts we us chrome.runtime.sendMessage(), and to listen for a message on popup or background we use chrome.extension.onMessage().

Now I need some kind of a Javascript Parser that will let me get the keywords on the page. It needs to be able to scan the text only, and ignore all html.

After checking the web there is no javascript library for parsing HTML. There was only one I found but I couldn’t get it to work properly, and the github repo had not been updated in 4 – 5 years. I checked for python based HTML parsers and there are quite a few. Beautiful Soup is a popular one. I wrote a python script using that, and basically got to extract all the text from a webpage, given a URL. Here is the python script for that:

def getText (url):
#get the DOM object from the webpage at ‘url
f = urllib.urlopen(url)
html_doc = f.read()
#convert into a BeautifulSOup Object
soup = BeautifulSoup(html_doc, ‘html.parser’)
#Remove all nodes which are enclosed in script or
#style tags, otherwise we would be extracting the text
#from inside these nodes, wheras we only want the text
#on the webpage
for elem in soup.findAll([‘script’, ‘style’]):
elem.extract()
#extract all the text from the soup object from
#the element tags
text = soup.get_text()
return text

I enclosed the script in a flask server.  I need to set this app up as a server and api point which we can query from the extension using Ajax, and for the python app to send back the webpage text to the extension according to the URL sent by the extension.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s