# Vör : A Knowledge Graph Project

In a group of similar Wikipedia topic, they tend to share a set of similar keywords. Some of them are literally the same or mean something very similar. Here I explore through the keywords collected from thousands of wikipedia articles and build a linked network.

## Involving a few toys

• MongoDB ~ Storing the crawled Wikipedia articles.
• OrientDB ~ Storing the keywords and links among them.
• RabbitMQ ~ Operating crawling queues.

## Wikipedia Crawler

The crawler process is simplified as it just picks up a seeding page, collects the whole content as text and stores inside MongoDB. Then it continues reading the next articles as linked from the seeding page. The process repeats until interruption.

## Keyword Extraction

I designed a keyword extractor very naïvely. Following are the key features of what it does:
• Remove non-alphabetic symbols.
• Remove all stopwords
• Remove words tagged as unwanted part-of-speech , eg. pronouns, adverbs etc.
• Joins remaining keywords where possible, e.g [Potassium,Nitrate] joins as a single word
[Potassium Nitrate].
The extraction works well enough but it doesn't perfectly filter out some garbage words we human know they are.

## Future Work

I'm now working on topic clustering based on keywords. We'll see how it works soon in the future.

## WIP : Using Word Vectors (Word2Vec)

Word2Vec is a descriptive model for text analysis which is able to identify similar words or opposite ones. In this project Word2Vec steps in to joins similar keywords in the knowledge model. The idea is simple, if Wikipedia pages share the close topics but they refer to some similar terms with different keywords, we should be able to identify them.
The idea behind this approach is to connect related keywords together so the related topics which don't share the same keywords but are semantically equivalent could have more connections between them.