Ben Humphreys

  • Archive
  • RSS

Fun LOLCats linguistic analysis from Superlinguo

    • #linguistics
    • #video
  • 2 months ago
  • 3
  • Comments
  • Permalink
  • Share
    Tweet

Unsupervised Syntactic Alignment with Inversion Transduction Grammars

I’m presenting this tomorrow in our weekly lab MT paper-introduction session. It’s my first time presenting and I’m still finding my feet in the field so it’s a little nerve-wracking.

    • #machine translation
    • #linguistics
  • 8 months ago
  • Comments
  • Permalink
  • Share
    Tweet

200 terabytes of wow.

    • #video
    • #linguistics
  • 10 months ago
  • Comments
  • Permalink
  • Share
    Tweet

Constructed Language Algorithm

I swear I have some of my weirder ideas in the shower. It’d be interesting to create a program that generated realistic natural language. Maybe it’d be useful to annoy linguists and make more Voynich manuscripts.

Off the top of my head, some of the things you’d have to randomly generate would be:

  • Word order - subject, object, verb, SVO, VSO etc.
  • Phonology - pick the sounds that are acceptable in the language.
  • Noun conjugation - make up some case endings, add them based on your SVO order. Could even give nouns 4+ different ‘genders’ for maximum annoyance and conjugation.
  • Verb conjugation - pick some likely verb conjugation for tenses. Suffixes, infixes, prefixes.
  • Word length - Longer for agglutinative languages. Maybe don’t have any spaces at all.
  • Writing direction - Pretty sure there’s only 2 options for this.
  • Alphabet - If you’re feeling adventurous you could generate some characters that look like they could be drawn. Oh or go for a mayan-ish script.

Mayan glyphs

I still can’t think of a possible use for this.

    • #linguistics
    • #programming
  • 1 year ago
  • Comments
  • Permalink
  • Share
    Tweet

今まで見たアメリカ人のよく間違える英語

これは半分冗談だが、さっきアメリカ人の添削を見てムカついた。

GoodとWellの使い分けです。

Good (adjective) = 良い

It was a good day.
Tim wrote a good report.

Well (adverb) = よく

Tim wrote his report well.
He plays football well.

日本語でよい/よくははっきり区別されていて、good/wellの使い分けを間違える日本人は一度も見た事がありません。

だがアメリカ人はよくこういう文書を言います。映画で何回も見たことがあります。

*You did good today.
(You did well today.)

*He fried that chicken good.
(He fried that chicken well.)

カジュアルな会話ならまだなんとか許せるけどきちんとした文章で見かけるとイライラします。

以上です(笑

(このポーストはLang-8で訂正してもらったバージョンです)

    • #english
    • #linguistics
    • #grammar
    • #japanese
  • 1 year ago
  • Comments
  • Permalink
  • Share
    Tweet

今まで見た日本人のよく間違える英語(第一弾)

(日本語を練習するためにこの作文をLang-8に書き込み、訂正してもらいました。以下は訂正されたバージョンです。)

これは単に日本人を批判しているわけではなく、ただ今まで日本人の友達の英語をいっぱい直していて、意外と似通った間違いが多いことに気づきました。

間違いのある英語の文章に*をつけます。

1)句読点の隣のスペースの間違え

これは簡単ですが学校できちんと教わらないようなので、間違う人がたくさんいます。「ネットだから気にしなくてもいい」と甘く考えていたら、大事な書類を書く時も間違える危険性があるので気をつけましょう。今日の3ポイントの中でこれは一番直しやすいので覚えましょう。

ルールは: * コンマ、句点前にスペースなし。 * コンマ、句点後にスペース一つ。

必ず気をつけてください。以下は間違えていて、読みにくいです。

*”How are you ?I’m fine”

*”We went shopping , had lunch and came home”

追加としてハイフンの前後にスペースは必要です。:と;はコンマ、句点と同じルールです。

“I wanted to go home - but I couldn’t.”

“Star Wars: The Empire Strikes Back”

これを身につけて綺麗で読みやすい文章を書きましょう。

2)複数

母国語にはない概念は外国語を話す時に特に難しいです。

英語にはない助詞は未だに私にとって難しいです。(特に「で」と「に」の使い分け) 英語では、複数なのにsを付けない言葉は特に難しいです。例えば stuff は数えない言葉だが *stuffs (というまちがい)は何度も見ました。

3)定冠詞 “the”、不定冠詞 “a”

これも日本語にはない概念で、しかも英語ではとても紛らわしいです。 ネイティブとして意識せずに響きで判断しますが、普段友達にこういう風に説明します。

  • the - 以前話した物、相手が知っている物であるか、もしくは世界には一つしかないもの、
  • a - 区別しない/できない物

これはこの3つの中で一番難しいと思います。例外はいっぱいありますので、たぶんセオリーだけではマスターできないと思います。英語をいっぱい聞いて、aとtheを使っているところ、使っていないところを意識して覚えるしかないと思います。

    • #english
    • #japanese
    • #linguistics
    • #grammar
  • 1 year ago
  • Comments
  • Permalink
  • Share
    Tweet

Korean Pronunciation for British English Speakers

Most online guides and books feature simple introductions to Korean pronunciation based on English words. The trouble with these guides is that they’re written from an American perspective. The same words pronounced by a British English speaker will sound totally different and actually make it harder for you to be understood in Korean.

Before diving into the examples I want to say that at first learning from English examples is OK, but when learning a foreign language significantly different to your own, it’s invaluable to know the International Phonetic Alphabet, or at least the subset of the phonemes that occur in your target language. In this case, the Wikipedia page on Korean Phonology is the best place to start. I’ll put the IPA symbols for the Korean vowels alongside so this should help.

Finally two important points: this guide is based on southern “BBC English”, also known as “Queen’s English”. Also I am still only a beginner in Korean so please get a native speaker to check your pronunciation.

Two “A”s and “O”

  • ㅏ ‘cat’ /a/
  • ㅓ ‘cart’ /ɘ/ or /ʌ/
  • ㅗ ‘cot’ /o/

I think it’s best to consider these first 3 vowels together as they are very close and it’s best to discuss their similarities and differences together.

ㅏ is a harder harsher sound sound as in ‘cat’ or ‘flap’. Whereas ㅓ is the more round sound as in ‘cart’ or ‘far’. I found it easy to remember that the right-facing ㅏ was a hard sound as in “attack” and the left-facing ㅓ as the slightly softer sounding of the two.

The difference between ㅏ and ㅓ is present within British English. Consider the difference between northern and southern English pronunciation of ‘grass’ and ‘glass’. In the north they are generally pronounced with a hard ㅏ but in the south they are pronounced with a Korean ㅓ.

One point to note is that in the English examples I gave, the /a/ in ‘cat’ is a shorter sound than the /ɘ/ in ‘cart’. When pronouncing Korean you should make the latter the same length as the short /a/ in ‘cat’.

Finally of the three, Korean ㅗ is pronounced identically to British English “cot”, with a short round o sound. This is different to the longer “oh” or “ou” sound in know/sew.

In addition I’ve been warned about not sticking out my lips far enough when I pronounce ㅗ. It’s easy to drift from ㅗ to ㅓ if you keep your mouth in the same shape, so try to make the ㅗ rounder than the ㅓ.

Two “U” Vowels

  • ㅜ round “oo”, stick out lips /u/
  • ㅡ wide “u”, spread mouth horizontally /ɯ/

The two Korean “u”s are difficult to pronounce and I admit I still cannot use them well in faster conversation.

The former ㅜ is the easier of the two to pronounce. It is close to English “food” but you must stick out your lips slightly more than

The latter ㅡ is worth studying via IPA and asking a Korean native speaker to pronounce for you. Before seeing it “in the flesh” I found the concept very hard to understand. In IPA it’s represented by /ɯ/ which is known as a close back unrounded vowel The key here is in the name, it’s pronounced in the back of the throat and most importantly you do not make the usual round mouth shape for “u” in English.

I remember the difference between the two through their shapes as characters. ㅡ closely resembles a thin wide mouth which corresponds to how it should be pronounced.

Two “E” Vowels

  • ㅔ /e/ Close-mid front unrounded vowel
  • ㅐ /ɛ/ Open-mid front unrounded vowel

The final distinction is arguably the most difficult. I’ve asked many native speakers about this and most say that even those fluent in Korean can’t hear the difference between the two of these when they are spoken at normal conversation speeds. However in theory there is a difference.

Their IPA names illustrate how close they really are. So far the simplest explanation I have come across is that ㅔ is a hard and ㅐ is softer. My only suggestion is to ask someone to pronounce it for you and watch their mouth.

The Rest

There are a number of dipthongs and other vowels like ㅚ and ㅟ. I might cover those in another post but if in doubt find a native speaker and get them to pronounce them for you. Good luck!

    • #korean
    • #phonetics
    • #phonology
    • #linguistics
  • 1 year ago
  • Comments
  • Permalink
  • Share
    Tweet

I Quit My Job

Just over a week ago I quit my job.

It’s not quite sunk in yet, but I want to write some reflections on the last 2 years that I spent working there, and what I plan to do next.

It was my first job straight out of university, and I was a programmer at a financial company. I dealt with financial data from various sources, processing it with Perl and importing it into our databases.

Overall the job was extremely Nice. Everyone was a pleasure to work with, the hours were good for Japan, the pay was great, the working environment was OK. The company has a very high retention rate, and I can see how it’s possible to keep working there for a long time.

The work itself was OK too, I was given free reign in how to solve the problems, I tried new techniques like test-driven development and people were very supportive.

I learned some skills I never thought I’d learn or to be honest, want to learn. I can now use Vim, and write DCL and Perl way better than I’d ever want to.

So in summary it was all very nice.

However, I came to realise that no matter how good the working conditions - hours, pay, etc. The thing that I really wanted was to be passionate about my work, to take pride in the things I created.

I wanted to be able to say to anybody “Hell yeah I work at this company, I get to make awesome stuff.” Irrespective of whether they think it’s cool or not, I want to think it’s cool.

It’s cheesy but I’m conscious I only get one go at life. If I’d wanted to stick with nice then I could have stayed in the UK.

Maybe it’s a case of “grass is greener” and there are programmers reading this who make “cool” stuff but would kill for a 9-5 well-paid job. But that’s how I feel at the moment.

So what next?

I said before I wanted to do something that I really care about. When I started learning Korean seriously last year, I looked into a subject called “Linguistics” that I’d heard of but had initially put into the Fluffy Humanities pile, along with English Literature and Biology (take that, biologists!).

The more I learned the more it seemed like actual science, with rules and stuff that even looked like programming.

To cut a long story short, if all goes well, from April next year I’ll be pursuing a PhD in Computational Linguistics here in Japan.

I’ll write more about this in the future, my reasons for doing it aren’t as off-hand as above but I don’t want to get into it now.

Until April I have 5 or so months off to take a breather and do the things I enjoy. I came straight from high school to university to a job without a break, so I think a few months off won’t kill me. I’m going to study Korean as much as I can, as well as get back into understanding the mathematics and theory behind computational linguistics and machine learning. Maybe take a road trip or something.

    • #work
    • #programming
    • #linguistics
    • #life
  • 1 year ago
  • Comments
  • Permalink
  • Share
    Tweet

Playing with NLTK

I was playing with the Python NLTK again for the first time in a while and made myself the task of finding out all the different ways “to go” is used in English. So I was aiming for a list including stuff like “to go up”, “to go down” etc.

Apologies if the Python code is horrible, I’m still getting used to how to do things in it. Consequently I’m thinking in a mix of Perl/Ruby and still reading the manual a lot.

Code:

import nltk
from nltk.corpus import brown
from nltk.stem.regexp import *

porter = nltk.PorterStemmer()

import re
p = re.compile('\w+')

words = []

def process(sentence):
    for (w1, t1), (w2, t2) in nltk.bigrams(sentence):
        stem1 = porter.stem(w1)
        if (stem1 == 'go' and t1.startswith('V') and p.match(t2)):
            words.append( 'to ' + stem1 + ' ' + w2 + ' (' + t1 + ', ' + t2 + ')' )

for tagged_sent in brown.tagged_sents():
    process(tagged_sent)

f = open('output.txt', 'w')

fdist = nltk.FreqDist(words)
for key, val in fdist.items():
    f.write( "%d : %s\n" % ( val, key ) )

The results are:

202 : to go to (VBG, TO)
94 : to go to (VB, IN)
34 : to go to (VBG, IN)
29 : to go on (VB, RP)
27 : to go on (VBG, RP)
25 : to go out (VB, RP)
24 : to go back (VB, RB)
18 : to go home (VB, NR)
18 : to go into (VB, IN)
18 : to go up (VB, RP)
18 : to go with (VB, IN)
10 : to go down (VB, RP)
9 : to go back (VBG, RB)
8 : to go into (VBG, IN)
8 : to go through (VB, IN)
7 : to go a (VB, AT)
7 : to go and (VB, CC)
7 : to go for (VB, IN)
7 : to go out (VBG, RP)
7 : to go over (VB, RP)
7 : to go to (VB, TO)
...

I could do more to trim down the results, like merging all the different types of verb part-of-speech tags (VB, VBG).

Any comments, I’d be happy to hear them.

    • #programming
    • #nltk
    • #python
    • #linguistics
  • 1 year ago
  • Comments
  • Permalink
  • Share
    Tweet

Playing with Translating

Been reading more about MT, and gave the Excite English translator a go.

In particular whether it knows what’s animate, and what’s inanimate.

There is a rock.

岩石があります。

Knows to use the correct inanimate あります.

There is a man.

男性がいます。

Knows that “man” is animate.

There is a rock man.

ロック男性がいます。

There is a weevil.

ワタノゾウムシがいます。

There is a red-spotted weevil.

赤くぶちのワタノゾウムシがいます。

There is a tyrannosaurus.

ティラノザウルスがいます。

There is a tick.

カチカチする音があります。

Oh dear fail.

There is a foo.

fooがあります。

Not sure what a foo is, defaults to inanimate.

As a bonus:

There is a walking man.

通行人がいます。

There is a man walking.

歩いている男性がいます。

    • #language
    • #linguistics
    • #mt
    • #japanese
    • #english
  • 1 year ago
  • Comments
  • Permalink
  • Share
    Tweet
← Newer • Older →
Page 1 of 2

About

Avatar Computational linguistics researcher at Kyoto University, focussing on machine translation. Also learning Japanese, Korean, French and other badassery.
(日本語版)

Me, Elsewhere

  • @benhumphreys on Twitter
  • benhumphreys on github
  • RSS
  • Random
  • Archive
  • Mobile

Effector Theme by Carlo Franco.

Powered by Tumblr