Markov Algorithm - The Theory

So, as I said, the next few articles I write, will be about content generation.

To be precise, we are not going to generate content, because 100% generated content is never good enough. Instead what we gonna do, is find existing content, read it, and manipulate it so much that nobody will ever see that the content is stolen.

To start off, we’ll see how can we take an existing article, and using automatic algorithms turn it into a fully readable, and yet totally different article.
To tell the truth, the article will not be 100% readable, good chance that parts of it will not make sense, but one thing I can tell you for sure - Google’s bots will never know it’s not original, authentic content, and even Google’s moderators, who will usually read your article hastily, will probably not suspect anything.

Today’s article will deal with the theory behind the algorithm. The next article will deal on how do you implement the algorithm in your code.

The idea of Markov’s algorithm is based on Markov Chains, so first lets see what Wikipedia has to say about Markov Chains:
In mathematics, a Markov chain, named after Andrey Markov, is a discrete-time stochastic process with the Markov property. Having the Markov property means that, given the present state, future states are independent of the past states. In other words, the present state description fully captures all the information that could influence the future evolution of the process, but for which future states will be reached through a probabilistic process instead of a deterministic one. Thus, given the present, the future is conditionally independent of the past.

At each time instant the system may change its state from the current state to another state, or remain in the same state, according to a certain probability distribution. The changes of state are called transitions, and the probabilities associated with various state-changes are termed transition probabilities.

Good chance that if you never learned Mathematical Probability, you’ll never understand the definition, but actually, the idea is quite simple.
What we have is a process, which always is in a certain state.
The process switches from one state to another according to a table of probabilities.
There’s a certain probability to switch from each state A to each state B, with two requirements:
a) The sum of probabilities to switch from any state A to all other states should be 100%(so that after each state there will definitely be another state).
b) The probability to move from one state to another is only dependent on the current state, and not on any previous states. This makes the process simpler to calculate.

Even if you don’t fully understand how this works, it doesn’t matter too much. What you have to understand is, that according to the current state, we can easily know what is the probabilities for the next state.

Now let’s see what all this has to do with Content Generation.

First, we’ll introduce a new variable: Granuality, or in short G.
G equals to the number of words required to determine the next word. In all my examples, I’ll use a granuallity of 2, because from my experience, this is the optimal granuality for optimal texts.

Now let’s build a probability table, in which the keys would be each G words that appear one after each other in a certain order, and for each key we’ll write down all the words that appear after that, with the amount number of times it appeared.

For example, let’s take the simple text: “My name is beautiful, my name is fair, my name will never be forgotten”
For the phrase “my name” we’ll two options: ‘is’ - which appeared 2 times, and ‘will’ that appeared one time.
This means that after the phrase “my name” there’s a probability of 2/3 that ‘is’ will appear, and 1/3 the ‘will’ will appear.
After the phrase “name is” there’s a probability of 1/2 that the word ‘beautiful’ will appear, and 1/2 that the word ‘fair’ will appear.

For phrases that appear only once, things are even easier, for example, after the phrase “will never”, the only possible word is ‘be’.

So, now let’s see how the process work: We take the first G words in our text, that’s the first state in which the process is in, find them in the probability table, and according to our probabilities randomly select the next word.
Now we have G+1 words in the new, re-written text. We’ll look at the last G words, look them up in the probability table, and according to it, randomly select the next word.
This way we continue until we have a text long enough.

Each time, the state of the Markov Chain is the last G words in our text, and the next state is determined by the probability table we have just constructed.

In case we reach a phrase that does not appear in the probability table(this usually happens when the last phrase in the source text appears only once), there are two options: either stop here, or start from the beginning again.
The biggest problem with the first approach, is that with short source texts, we might have a long phrase that repeats itself.
However, with longer text, this problem nearly never occurs, and the second approach might return a very short text from once in a while.
Therefore, I prefer the first approach, but you may choose whatever suits you.

That’s all for today, in case you understood nothing, in my next post I’ll be showing you how to implement the thing in your code, and this might simplify a few things.
Stay with me, and you too, will never ever again have to buy any more content generation software, because you’ll be able to write it yourself.
Don’t forget to subscribe to the RSS feed, so that you don’t miss the next article.

If you enjoyed this post, buy me a beer.

Share and Enjoy:These icons link to social bookmarking sites where readers can share and discover new web pages.
  • blinkbits
  • BlinkList
  • blogmarks
  • co.mments
  • connotea
  • del.icio.us
  • De.lirio.us
  • digg
  • Fark
  • feedmelinks
  • Furl
  • LinkaGoGo
  • Ma.gnolia
  • NewsVine
  • Netvouz
  • RawSugar
  • Reddit
  • scuttle
  • Shadows
  • Simpy
  • Smarking
  • Spurl
  • TailRank
  • Wists
  • YahooMyWeb

Related Reading:

  • No related posts



Thank you for reading this post. You can now Leave A Comment (0) or Leave A Trackback.

RSS feed | Trackback URI

Comments »

No comments yet.

Name (required)
E-mail (required - never shown publicly)
URI
Subscribe to comments via email
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong> in your comment.

*
To prove you're a person (not a spam script), type the security word shown in the picture.
Anti-Spam Image