| Sponsored Links |
Compare contents of each (div/table) tag against the search query used to locate the site. If a match is found, you have found your div tag which should contain the content. Also remember to strip eg. javascript (in order to avoid google ads/unrelated content in your article).
Not a bulletproof solution but the best I could think of in a minute.
Thanks for the advice, heiska!
What I came up with yesterday:
1. Just allow a-z A-Z 0-9 , ! . ? -
If there's any other character in it, it's not a sentence! This will filter out some correct sentences but works quite good...
2. Check for the length and the number of spaces in it.
3. Only grab content between p-html-tags!
The results are pretty good now...
Bookmarks