LibThai as PHP Extension
Posted August 11th, 2007 by vorapoap
Are there anybody interested in doing this?
At first I think it would be an easy thing to learn to create PHP extension.
I was trying to do it but my head is so full with other project.
Probably I haven't found any good/right tutorial.. or maybe it is not that easy..
Is anyone willing to do this?...
This should be great for many Thai web developers.
Thank you
Re: LibThai as PHP Extension
If it's not serious work, and no need high performance.
php's exec external shell cmd may be help.
Re: LibThai as PHP Extension
I don't think PHP exec command is a suitable tool for a site with high traffic.
What I am trying to do is to cut(en-segment) all Thai words and put in to database for every posting to forum.
(The same that PHPBB2 do to its searching function).
I think this would help the Thai Search in forum dramatically.
Re: LibThai as PHP Extension
At last I succeeded in creating PHP extension for libthai and libwordcut library.
If you are interested, please reply to this forum.
I will make it public as soon as I think it is ready.
The library will be used in my forum on khum.net for Thai FullText search with MySQL.
Thai FullText search??? not impossible??? it is..
I am testing it right now...
Re: LibThai as PHP Extension
Thanks for your work. I've heard some people asking me about this kind of thing for some time. It's becoming true at last!
How would you distribute it? Via php-pear? Or do you need facility like CVS or FTP? Just let us know if you do. You can even create a story to announce it on LTN first page when it's finished.
Re: LibThai as PHP Extension
It would be on PECL.... before that...
I am looking at one down side about libthai comparing to libwordcut.
It is that libwordcut allow you to pre-load dictionary in a memory beforehand and re-use it anytime if you want.
By this feature, I am able to load dictionary along with Apache/Php initialization, and free it when Apache shut down.
It seems that libthai doesn't allow to do this, I may have to modify your libthai source in order to take the advantage of loading dictionary.
Or you have any othe rcomment?
PS. I think below Captcha seems so difficult to go pass :( ...
Re: LibThai as PHP Extension
In this aspect, libthai follows lazy-initialization scheme. The dictionary is loaded on first call to th_brk() (that is, it's never loaded if you are not using the line breaker feature), and stays in memory until the library is unloaded. [Ref: brk_get_dict() in src/thbrk/brk-maximal.c] I'm not sure if this is close to what you want.
Sorry for the captcha. We need it to prevent spammers who are intentionally subscribed solely for posting ads. (They did!) Besides, you are helping a collaborative OCR project to digitize books for internet archive every time you answer the captcha!
Tips: You can request for a new captcha by clicking on the renew button (above the speaker icon) whenever you are really in trouble with a given captcha.
Re: LibThai as PHP Extension
From what you mentioned, may I assume that if I call brk_get_dict() at the initialization of Apache/PHP, the dictionary data will stay there and being resused by any requests to the web server. And the dictionary is automatically freed when Apache is terminated.?
And I have to apologize about my knowledge of C programming, it may not be high enough to find how exactly you free your dictionary in that source code... In your source code. brk_root_pool is called at brk_maximal_do and brk_pool_free is called at the end of brk_maximal_do? Doesn't it mean that you allocate memory at the end of each call to brk_maximal_do? Please correct me if I am wrong.
Hmm, I am not sure about this. To write a PHP extension, I can specify a global variable and allocate a memory for it (only once) and that can be reused any time during the Apache/PHP session without needing to be reallocated.
About captcha, it is alright... :P
Re: LibThai as PHP Extension
You don't need to explicitly call brk_get_dict() yourself. It's internally called from brk_root_pool(), which is called by brk_maximal_do(). And you can see in brk_get_dict() definition that the dictionary is only loaded once on the first call. On next calls, the loading is skipped, as the static variable is already not NULL.
All this mechanism is hidden from libthai clients. As I said, it's automatically done as lazy initialization. And no, it doesn't allocate memory on every call. Just the first.
But you are right that the dictionary is not explicitly unloaded. I just assume it's unloaded with the library's data segment when the process terminates or the shared object is unloaded. It may not be so elegant, though. But I do plan to do it more explicitly some time.
Re: LibThai as PHP Extension
I have succeeded in building Thai FULLTEXT Search with MySQL.
Thanks to your library,
I will release this extension soon.. but don't you think I should interface other functions as well?
I will have to find some free time to do so... :( or may I release just the extension with only single function to be called?
---
Moreover I found a trivial problem, how to improve breaking words that don't exist in dictionary like 'ดาจิม'
The operation results in 'ดา จิ ม'
I think since 'ม' cannot exist on it own (1 char in length). It should be re-attach with จิ...
???
Re: LibThai as PHP Extension
Probably, anyone has been able to do this long time ago... well
Noone really wants it I guess.