There is some code for text extraction from HTML documents and one or two utilities in Sanchay <http://sanchay.co.in>, but there is no documentation and the it is not connected to the current public GUI. The available code will have to be slightly modified for specific formats: some simple code that uses the HTML parser library to effectively create a template for extraction of a specific format. For a single format, it is not very time consuming.
On Mon, Aug 2, 2010 at 4:54 PM, Siddhartha Jonnalagadda <sid.kgp at gmail.com>wrote:
> Thanks all for your replies. I am trying BoilerPipe now; will also look
> into the other things mentioned.
>
> thanks again,
> siddhartha
>
> On Mon, Aug 2, 2010 at 2:51 AM, Wouter Weerkamp <w.weerkamp at uva.nl> wrote:
>
>> In 2007 there was a workshop on content extraction from web pages. You
>> could gave a look at the papers presented there:
>> http://cleaneval.sigwac.org.uk/
>>
>> If you intend to follow feeds, and need to extract content from these, you
>> can use a learning approach. For each feed you collect a certain number of
>> pages, and you learn which part of the page changes, and which parts don't.
>> From that it shouldn't be hard to determine "real" content.
>>
>> You could also have a look at fivefilters, it works pretty good given the
>> simple approach is uses:
>> http://fivefilters.org/content-only/
>> (following a few links, you can get to the (php) code).
>>
>> Wouter
>>
>>
>>
>> On 8/1/10 8:08 PM, Beatrice Alex wrote:
>>
>>> You might want to check out Boilerpipe:
>>>
>>> http://code.google.com/p/boilerpipe/
>>>
>>> Best,
>>>
>>> Bea
>>>
>>> ------------------
>>> Beatrice Alex
>>> Research Fellow and Project Manager at the School of Informatics,
>>> University of Edinburgh.
>>>
>>>
>>> On 1 Aug 2010, at 01:26, Siddhartha Jonnalagadda wrote:
>>>
>>> Is it trivial to extract the title and relevant text (ignoring the ads
>>>> and other irrelevant stuff)? For example, in the website:
>>>> http://tvnz.co.nz/world-news/chelsea-clinton-marries-in-ny-3680168
>>>>
>>>> I am only interested in extracting the tile: "Chelsea Clinton marries in
>>>> NY"
>>>> and the subject below. How easy is this?
>>>>
>>>> "Bill and Hillary Clinton's daughter married her long-time boyfriend in
>>>> the picturesque New York village of Rhinebeck today in what has been dubbed
>>>> America's royal wedding.
>>>> Chelsea Clinton - the only child of the former US president and the US
>>>> secretary of state - wed Marc Mezvinsky at Astor Courts, an historic 50-acre
>>>> (20-hectare) estate on the Hudson River, about 160 km north of New York
>>>> City.
>>>>
>>>> "Today, we watched with great pride and overwhelming emotion as Chelsea
>>>> and Marc wed in a beautiful ceremony at Astor Courts, surrounded by family
>>>> and their close friends," Bill and Hillary Clinton said in a statement.
>>>>
>>>> "We could not have asked for a more perfect day to celebrate the
>>>> beginning of their life together, and we are so happy to welcome Marc into
>>>> our family," the statement said.
>>>>
>>>> "On behalf of the newlyweds, we want to give special thanks to the
>>>> people of Rhinebeck for welcoming us and to everyone for their well-wishes
>>>> on this special day."
>>>>
>>>> The statement, sent just after 7:30 pm (12:30pm NZT today), did not
>>>> indicate exactly when the nuptials took place.
>>>>
>>>> On Friday night, Bill and Hillary Clinton waved to crowds of onlookers
>>>> as they arrived at the historic Beekman Arms Inn in the center of Rhinebeck
>>>> for a late-night cocktail party for some of the wedding guests.
>>>>
>>>>
>>>>
>>>> Apart from the parents of the bride, the only other high profile guests
>>>> seen in Rhinebeck have been Bill Clinton's former secretary of state,
>>>> Madeleine Albright, actors Ted Danson and Mary Steenburgen and fashion
>>>> designer Vera Wang.
>>>>
>>>> Also spotted was real estate scion and movie producer billionaire Steve
>>>> Bing. Bing lent Bill Clinton his jet to fly to North Korea in August of last
>>>> year to bring home American journalists Laura Ling and Euna Lee after they
>>>> spent four months imprisoned in the reclusive communist state.
>>>>
>>>> Guests boarded buses in Rhinebeck to be taken to Astor Courts about 5 pm
>>>> EDT (10am NZT)
>>>>
>>>> Chelsea Clinton, 30, and Mezvinsky, 32, have known each other since they
>>>> were teenagers. He is an investment banker, whose parents Marjorie
>>>> Margolies-Mezvinsky and Edward Mezvinsky were once Democratic US House of
>>>> Representatives members.
>>>>
>>>> Chelsea Clinton, who worked at a New York hedge fund and has more
>>>> recently studied health policy at Columbia University, has kept a low
>>>> profile since her father left the White House in January 2001, although she
>>>> campaigned for her mother during her failed run for the 2008 Democratic
>>>> presidential nomination.
>>>>
>>>> Signs and pictures congratulating the newlyweds hang in many shop
>>>> windows in Rhinebeck, which has been swarmed by media around the world for
>>>> an event that experts estimate to have cost between $US3 million and $US5
>>>> million.
>>>>
>>>> Airspace above Rhinebeck has been closed for 12 hours from 3 pm EDT (7pm
>>>> NZT) today for the wedding and media were kept well away from the entrance
>>>> to Astor Courts. Security in the area was comparable to that surrounding
>>>> state visits.
>>>>
>>>> The guest list was reported to be between 400 and 500, but did not
>>>> include a very understanding President Barack Obama.
>>>>
>>>> "Hillary and Bill properly want to keep this as a thing for Chelsea and
>>>> her soon-to-be husband," Obama said on The View talk show on Thursday. "It
>>>> would be tough enough to have one president at a wedding. You don't want two
>>>> presidents."
>>>>
>>>> "
>>>> _______________________________________________
>>>> Corpora mailing list
>>>> Corpora at uib.no
>>>> http://mailman.uib.no/listinfo/corpora
>>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Corpora mailing list
>>> Corpora at uib.no
>>> http://mailman.uib.no/listinfo/corpora
>>>
>>
>> --
>> ISLA * University of Amsterdam * http://ilps.science.uva.nl
>>
>>
>> _______________________________________________
>> Corpora mailing list
>> Corpora at uib.no
>> http://mailman.uib.no/listinfo/corpora
>>
>
>
> _______________________________________________
> Corpora mailing list
> Corpora at uib.no
> http://mailman.uib.no/listinfo/corpora
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 8748 bytes
Desc: not available
URL: <http://www.uib.no/mailman/public/corpora/attachments/20100802/7731ef40/attachment.txt>