鵝廠面試題，英語單詞拼寫檢查算法？-招聘街

一、鵝廠面試題，英語單詞拼寫檢查算法？

又到安利Python的時間，最終代碼不超過30行（優(yōu)化前），加上優(yōu)化也不過40行。

第一步. 構(gòu)造Trie（用dict登記結(jié)點信息和維持子結(jié)點集合）：

-- 思路：對詞典中的每個單詞，逐詞逐字母拓展Trie，單詞完結(jié)處的結(jié)點用None標(biāo)識。

def make_trie(words):
    trie = {}
    for word in words:
        t = trie
        for c in word:
            if c not in t: t[c] = {}
            t = t[c]
        t[None] = None
    return trie

第二步. 容錯查找（容錯數(shù)為tol）：

-- 思路：實質(zhì)上是對Trie的深度優(yōu)先搜索，每一步加深時就消耗目標(biāo)詞的一個字母。當(dāng)搜索到達(dá)某個結(jié)點時，分為不消耗容錯數(shù)和消耗容錯數(shù)的情形，繼續(xù)搜索直到目標(biāo)詞為空。搜索過程中，用path記錄搜索路徑，該路徑即為一個詞典中存在的詞，作為糾錯的參考。

-- 最終結(jié)果即為諸多搜索停止位置的結(jié)點路徑的并集。

def check_fuzzy(trie, word, path='', tol=1):
    if word == '':
        return {path} if None in trie else set()
    else:
        p0 = set()
        if word[0] in trie:
            p0 = check_fuzzy(trie[word[0]], word[1:], path+word[0], tol)
        p1 = set()
        if tol > 0:
            for k in trie:
                if k is not None and k != word[0]:
                    p1.update(check_fuzzy(trie[k], word[1:], path+k, tol-1))
        return p0 | p1

簡單測試代碼 ------

構(gòu)造Trie：

words = ['hello', 'hela', 'dome']
t = make_trie(words)

In [11]: t
Out[11]: 
{'d': {'o': {'m': {'e': {'$': {}}}}},
 'h': {'e': {'l': {'a': {'$': {}}, 'l': {'o': {'$': {}}}}}}}

容錯查找：

In [50]: check_fuzzy(t, 'hellu', tol=0)
Out[50]: {}

In [51]: check_fuzzy(t, 'hellu', tol=1)
Out[51]: {'hello'}

In [52]: check_fuzzy(t, 'healu', tol=1)
Out[52]: {}

In [53]: check_fuzzy(t, 'healu', tol=2)
Out[53]: {'hello'}

似乎靠譜~

---------------------------分--割--線--------------------------------------

以上是基于Trie的approach，另外的approach可以參看@黃振童鞋推薦Peter Norvig即P神的How to Write a Spelling Corrector

雖然我已有意無意模仿P神的代碼風(fēng)格，但每次看到P神的源碼還是立馬跪...

話說word[1:]這種表達(dá)方式其實是有淵源的，相信有的童鞋對(cdr word)早已爛熟于心...（呵呵

------------------------分-----割-----線-----二--------------------------------------

回歸正題.....有童鞋說可不可以增加新的容錯條件，比如增刪字母，我大致對v2方法作了點拓展，得到下面的v3版本。

拓展的關(guān)鍵在于遞歸的終止，即每一次遞歸調(diào)用必須對參數(shù)進(jìn)行有效縮減，要么是參數(shù)word，要么是參數(shù)tol~

def check_fuzzy(trie, word, path='', tol=1):
    if tol < 0:
        return set()
    elif word == '':
        results = set()
        if None in trie:
            results.add(path)
        # 增加詞尾字母
        for k in trie:
            if k is not None:
                results |= check_fuzzy(trie[k], '', path+k, tol-1)
        return results
    else:
        results = set()
        # 首字母匹配
        if word[0] in trie:
            results |= check_fuzzy(trie[word[0]], word[1:], path + word[0], tol)
        # 分情形繼續(xù)搜索（相當(dāng)于保留待探索的回溯分支）
        for k in trie:
            if k is not None and k != word[0]:
                # 用可能正確的字母置換首字母
                results |= check_fuzzy(trie[k], word[1:], path+k, tol-1)
                # 插入可能正確的字母作為首字母
                results |= check_fuzzy(trie[k], word, path+k, tol-1)
        # 跳過余詞首字母
        results |= check_fuzzy(trie, word[1:], path, tol-1)
        # 交換原詞頭兩個字母
        if len(word) > 1:
            results |= check_fuzzy(trie, word[1]+word[0]+word[2:], path, tol-1)
        return results

好像還是沒有過30行……注釋不算（

本答案的算法只在追求極致簡潔的表達(dá)，概括問題的大致思路。至于實際應(yīng)用的話可能需要很多Adaption和Tuning，包括基于統(tǒng)計和學(xué)習(xí)得到一些詞語校正的bias。我猜測這些拓展都可以反映到Trie的結(jié)點構(gòu)造上面，比如在結(jié)點處附加一個概率值，通過這個概率值來影響搜索傾向；也可能反映到更多的搜索分支的控制參數(shù)上面，比如增加一些更有腦洞的搜索分支。（更細(xì)節(jié)的問題這里就不深入了逃

----------------------------------分-割-線-三----------------------------------------

童鞋們可能會關(guān)心時間和空間復(fù)雜度的問題，因為上述這種優(yōu)（cu）雅（bao）的寫法會導(dǎo)致產(chǎn)生的集合對象呈指數(shù)級增加，集合的合并操作時間也指數(shù)級增加，還使得gc不堪重負(fù)。而且，我們并不希望搜索算法一下就把所有結(jié)果枚舉出來（消耗的時間亦太昂貴），有可能我們只需要搜索結(jié)果的集合中前三個結(jié)果，如果不滿意再搜索三個，諸如此類...

那腫么辦呢？................是時候祭出yield小魔杖了? ??)ノ

下述版本姑且稱之為lazy，看上去和v3很像（其實它倆在語義上是幾乎等同的

def check_lazy(trie, word, path='', tol=1):
    if tol < 0:
        pass
    elif word == '':
        if None in trie:
            yield path
        # 增加詞尾字母
        for k in trie:
            if k is not None:
                yield from check_lazy(trie[k], '', path + k, tol - 1)
    else:
        if word[0] in trie:
            # 首字母匹配成功
            yield from check_lazy(trie[word[0]], word[1:], path+word[0], tol)
        # 分情形繼續(xù)搜索（相當(dāng)于保留待探索的回溯分支）
        for k in trie:
            if k is not None and k != word[0]:
                # 用可能正確的字母置換首字母
                yield from check_lazy(trie[k], word[1:], path+k, tol-1)
                # 插入可能正確的字母作為首字母
                yield from check_lazy(trie[k], word, path+k, tol-1)
        # 跳過余詞首字母
        yield from check_lazy(trie, word[1:], path, tol-1)
        # 交換原詞頭兩個字母
        if len(word) > 1:
            yield from check_lazy(trie, word[1]+word[0]+word[2:], path, tol-1)

不借助任何容器對象，我們近乎聲明式地使用遞歸子序列拼接成了一個序列。

[新手注釋] yield是什么意思呢？就是程序暫停在這里了，返回給你一個結(jié)果，然后當(dāng)你調(diào)用next的時候，它從暫停的位置繼續(xù)走，直到有下個結(jié)果然后再暫停。要理解yield，你得先理解yield... Nonono，你得先理解iter函數(shù)和next函數(shù)，然后再深入理解for循環(huán)，具體內(nèi)容童鞋們可以看官方文檔。而yield from x即相當(dāng)于for y in x: yield y。

給剛認(rèn)識yield的童鞋一個小科普，順便回憶一下組合數(shù)C(n,m)的定義即

C(n, m) = C(n-1, m-1) + C(n-1, m)

如果我們把C視為根據(jù)n和m確定的集合，加號視為并集，利用下面這個generator我們可以懶惰地逐步獲取所有組合元素：

def combinations(seq, m):
    if m > len(seq):
        raise ValueError('Cannot choose more than sequence has.')
    elif m == 0:
        yield ()
    elif m == len(seq):
        yield tuple(seq)
    else:
        for p in combinations(seq[1:], m-1):
            yield (seq[0],) + p
        yield from combinations(seq[1:], m)

for combi in combinations('abcde', 2): 
    print(combi)

可以看到，generator結(jié)構(gòu)精準(zhǔn)地反映了集合運算的特征，而且蘊含了對元素進(jìn)行映射的邏輯，可讀性非常強。

OK，代碼到此為止。利用next函數(shù)，我們可以懶惰地獲取查找結(jié)果。

In [54]: words = ['hell', 'hello', 'hela', 'helmut', 'dome']

In [55]: t = make_trie(words)

In [57]: c = check_lazy(t, 'hell')

In [58]: next(c)
Out[58]: 'hell'

In [59]: next(c)
Out[59]: 'hello'

In [60]: next(c)
Out[60]: 'hela'

話說回來，lazy的一個問題在于我們不能提前預(yù)測并剔除重復(fù)的元素。你可以采用一個小利器decorator，修飾一個generator，保證結(jié)果不重復(fù)。

from functools import wraps

def uniq(func):
    @wraps(func)
    def _func(*a, **kw): 
        seen = set()
        it = func(*a, **kw)
        while 1: 
            x = next(it) 
            if x not in seen:
                yield x
                seen.add(x) 
    return _func

這個url打開的文件包含常用英語詞匯，可以用來測試代碼：

In [10]: import urllib

In [11]: f = urllib.request.urlopen("https://raw.githubusercontent.com/eneko/data-repository/master/data/words.txt")

# 去除換行符
In [12]: t = make_trie(line.decode().strip() for line in f.readlines())

In [13]: f.close()

----------------------分-割-線-四-----------------------------

最后的最后，Python中遞歸是很昂貴的，但是遞歸的優(yōu)勢在于描述問題。為了追求極致性能，我們可以把遞歸轉(zhuǎn)成迭代，把去除重復(fù)的邏輯直接代入進(jìn)來，于是有了這個v4版本：

from collections import deque

def check_iter(trie, word, tol=1):
    seen = set()
    q = deque([(trie, word, '', tol)])
    while q:
        trie, word, path, tol = q.popleft()
        if word == '':
            if None in trie:
                if path not in seen:
                    seen.add(path)
                    yield path
            if tol > 0:
                for k in trie:
                    if k is not None:
                        q.appendleft((trie[k], '', path+k, tol-1))
        else:
            if word[0] in trie:
                q.appendleft((trie[word[0]], word[1:], path+word[0], tol))
            if tol > 0:
                for k in trie.keys():
                    if k is not None and k != word[0]:
                        q.append((trie[k], word[1:], path+k, tol-1))
                        q.append((trie[k], word, path+k, tol-1))
                q.append((trie, word[1:], path, tol-1))
                if len(word) > 1:
                    q.append((trie, word[1]+word[0]+word[2:], path, tol-1))

可以看到，轉(zhuǎn)為迭代方式后我們?nèi)匀豢梢宰畲蟪潭缺Ａ暨f歸風(fēng)格的程序形狀，但也提供了更強的靈活性（對于遞歸，相當(dāng)于我們只能用棧來實現(xiàn)這個q）。基于這種迭代程序的結(jié)構(gòu)，如果你有詞頻數(shù)據(jù)，可以用該數(shù)據(jù)維持一個最優(yōu)堆q，甚至可以是根據(jù)上下文自動調(diào)整詞頻的動態(tài)堆，維持高頻詞匯在堆頂，為詞語修正節(jié)省不少性能。這里就不深入了。

【可選的一步】我們在對單詞進(jìn)行糾正的時候往往傾向于認(rèn)為首字母是無誤的，利用這個現(xiàn)象可以減輕不少搜索壓力，花費的時間可以少數(shù)倍。

def check_head_fixed(trie, word, tol=1):
    for p in check_lazy(trie[word[0]], word[1:], tol=tol):
        yield word[0] + p

最終我們簡單地benchmark一下：

In [18]: list(check_head_fixed(trie, 'misella', tol=2))
Out[18]:
['micellar',
 'malella',
 'mesilla',
 'morella',
 'mysell',
 'micelle',
 'milla',
 'misally',
 'mistell',
 'miserly']

In [19]: %timeit list(check_head_fixed(trie, 'misella', tol=2))
1.52 ms ± 2.84 μs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

在Win10的i7上可以在兩毫秒左右返回所有結(jié)果，可以說令人滿意。

二、蜜蜂每天檢查可以不

蜜蜂是我們生活中非常重要的昆蟲之一，它們不僅能夠采集花蜜，制作蜂蜜，還可以幫助花朵傳粉，促進(jìn)植物的繁殖。蜜蜂社會高度有組織，分工明確，每只蜜蜂承擔(dān)著特定的任務(wù)，其中之一就是每天的檢查工作。

為什么蜜蜂每天要進(jìn)行檢查？

蜜蜂每天進(jìn)行檢查有著重要的意義和作用。首先，蜜蜂需要確保蜂巢內(nèi)的環(huán)境和資源充足，以供整個蜂巢內(nèi)的成員生存和繁衍。蜜蜂會檢查蜂巢內(nèi)的溫度、濕度等環(huán)境因素是否適宜，還會檢查蜂巢中的食物供應(yīng)是否充足。其次，蜜蜂需要確保蜂巢的安全和免受病害侵?jǐn)_。

蜜蜂每天的檢查內(nèi)容

蜜蜂每天的檢查內(nèi)容主要包括以下幾個方面：

巢穴檢查：蜜蜂會檢查巢穴的結(jié)構(gòu)是否完好，并修復(fù)可能存在的損壞。巢穴的結(jié)構(gòu)穩(wěn)固和完整性對整個蜂巢的運行至關(guān)重要。
食物供應(yīng)檢查：蜜蜂會檢查花蜜儲存量是否充足，并將花蜜逐漸轉(zhuǎn)化為蜂蜜。如果食物供應(yīng)不足，蜜蜂會及時尋找新的花朵采集花蜜。
病蟲害檢查：蜜蜂會檢查蜂巢內(nèi)是否出現(xiàn)病蟲害，并采取相應(yīng)的措施進(jìn)行防治，以保證整個蜂群的健康。
雄蜂、雌蜂檢查：蜜蜂會檢查蜂巢內(nèi)是否需要繁殖新的蜜蜂，并調(diào)控雄蜂和雌蜂的比例，以保持蜂巢內(nèi)的平衡。
蜂蜜質(zhì)量檢查：蜜蜂會檢查蜂蜜的質(zhì)量，并確保蜂巢內(nèi)的蜂蜜干燥且符合標(biāo)準(zhǔn)。

蜜蜂檢查的重要性

蜜蜂每天的檢查工作對整個蜂巢的運行至關(guān)重要。首先，巢穴檢查可以保證整個蜂巢的穩(wěn)固，防止蜂巢的損壞和坍塌，從而保護(hù)蜜蜂的生命安全。其次，食物供應(yīng)檢查可以確保蜂巢內(nèi)的成員有足夠的食物供應(yīng)，避免食物短缺導(dǎo)致蜜蜂饑餓而死亡。同時，病蟲害檢查可以及時發(fā)現(xiàn)和控制病蟲害，保持整個蜂群的健康和繁衍。

此外，蜜蜂每天的檢查還有助于維持蜂巢內(nèi)的生態(tài)平衡和社會秩序。通過調(diào)控雄蜂和雌蜂的比例，蜜蜂可以控制蜂巢內(nèi)的種群數(shù)量和結(jié)構(gòu)，以適應(yīng)不同的環(huán)境條件。而蜂蜜質(zhì)量檢查則可以保證蜂巢內(nèi)的蜂蜜的營養(yǎng)價值和品質(zhì)，為蜜蜂提供充足的能量和營養(yǎng)。

結(jié)語

蜜蜂每天進(jìn)行檢查，是它們生活中不可或缺的一部分。這項工作保證了蜂巢內(nèi)的秩序和穩(wěn)定，保障了整個蜂群的生存和繁衍。正是蜜蜂每天的默默努力，才有了我們?nèi)粘Ｊ秤玫拿牢斗涿邸Ｖ档梦覀儗λ鼈兊男燎诟冻霰硎揪匆猓⒓訌姳Ｗo(hù)和關(guān)注蜜蜂的生存環(huán)境，共同守護(hù)著這個與人類息息相關(guān)的生態(tài)系統(tǒng)。

三、空調(diào)不制冷怎么檢查？

1.首先要檢查空調(diào)是否正常運轉(zhuǎn)，包括室內(nèi)機(jī)和室外機(jī)，尤其是要檢查室外機(jī)的壓縮機(jī)，只要有其中一個沒有正常運轉(zhuǎn)，那么就得請師傅來維修一下。

2.然后要檢查空調(diào)的過濾網(wǎng)，如果空調(diào)正常啟動，但是卻散不出冷氣，很有可能是過濾網(wǎng)灰塵太多，最好用那種專門的過濾網(wǎng)清洗劑噴一下清洗。

3.再檢查一下制冷劑是不是不足，也就是所謂的氟利昂，如果不足了的話就要加氟利昂，同時檢查冷循環(huán)系統(tǒng)是不是出現(xiàn)了漏氣現(xiàn)象，如果出現(xiàn)了漏氣現(xiàn)象的就得補漏。

4.如果沒有上述問題，空調(diào)也能正常運轉(zhuǎn)，那么要檢查空調(diào)風(fēng)扇運轉(zhuǎn)情況，假如風(fēng)扇的風(fēng)量非常小，那么可能是風(fēng)扇電機(jī)老舊，要換新的風(fēng)扇電機(jī)。

5.如果空調(diào)的通風(fēng)管安裝的比較長，可能會導(dǎo)致將室內(nèi)熱量傳到室外的效果比較差，從而出現(xiàn)制冷不是很明顯的情況，最好找個好一點的位置重新安裝，爭取將通風(fēng)管最短化。

6.如果空調(diào)運行正常，而且其他一切也都完好，但是房間里感覺不到冷氣，只有湊到空調(diào)吹風(fēng)口上才有冷氣，很可能是因為空調(diào)的馬力太小，而房間的空間太大導(dǎo)致，這時最好換一個馬力較大的空調(diào)。

四、怎么檢查空調(diào)不制冷？

空調(diào)不制冷的檢查方法有：

檢查散熱器是否需要清洗。空調(diào)需要換熱器進(jìn)行室內(nèi)外的熱量轉(zhuǎn)移，當(dāng)換熱器上覆蓋大量灰塵、蛛網(wǎng)等垃圾時，就無法進(jìn)行有效散熱，導(dǎo)致空調(diào)制冷效果差，需要對內(nèi)機(jī)和外機(jī)的換熱器進(jìn)行清洗。

檢查電壓是否穩(wěn)定。用電高峰期供電電壓會不穩(wěn)定，達(dá)不到空調(diào)運轉(zhuǎn)所需的正常電壓，會導(dǎo)致空調(diào)不制冷或效果差，高溫期間大量空調(diào)同時運轉(zhuǎn)更容易出現(xiàn)這種現(xiàn)象。這種問題個人是無法解決的，可以選用變頻空調(diào)

五、做胃檢查空腹不？

胃部檢查需要禁食。目前常用的胃檢查方法包括碳13呼氣試驗、碳14呼氣試驗、上消化道鋇餐、胃鏡和胃功能檢查。所有這些檢測方法都需要在禁食禁水的條件下進(jìn)行，否則一方面會影響檢測的準(zhǔn)確性，干擾檢測結(jié)果，出現(xiàn)假陽性。另一方面，它會影響胃粘膜的觀察和病理變化，還會導(dǎo)致胃內(nèi)容物在惡心嘔吐的情況下回流到食道，導(dǎo)致窒息或肺部感染。

六、機(jī)動車檢查合格標(biāo)志可以不貼不？

3月1日起，3項試點的新措施已經(jīng)開始執(zhí)行。其中一項就是紙質(zhì)機(jī)動車檢驗標(biāo)志電子化，跟著元貝駕考小編一起來看看吧！

一：機(jī)動車檢驗標(biāo)志電子化有哪些城市？

試點城市：北京、天津、上海、重慶、哈爾濱、南京、杭州、寧波、濟(jì)南、株洲、深圳、海口、成都、貴陽、玉溪、烏魯木齊等16個城市。

領(lǐng)取方法：駕駛?cè)丝梢酝ㄟ^互聯(lián)網(wǎng)交通安全綜合服務(wù)平臺或登錄“交管12123”手機(jī)APP，已經(jīng)領(lǐng)取檢驗標(biāo)志電子憑證的機(jī)動車，無需在窗口粘貼紙質(zhì)標(biāo)志。

二、特殊機(jī)動車如何辦理？

辦理注冊、轉(zhuǎn)移、變更登記的機(jī)動車：駕駛?cè)嗽谵k理完機(jī)動車登記發(fā)放紙質(zhì)檢驗標(biāo)志之后。系統(tǒng)會自動生成電子憑證，駕駛?cè)丝赏ㄟ^上述的領(lǐng)取方法進(jìn)行領(lǐng)取、查看。

六年內(nèi)免檢的機(jī)動車：機(jī)動車車主可直接通過上述的領(lǐng)取方法進(jìn)行領(lǐng)取、查看。如需要紙質(zhì)標(biāo)志憑證，可以自行前往車管所領(lǐng)取或郵寄。

線上檢驗的機(jī)動車：機(jī)動車檢驗合格之后，在領(lǐng)取紙質(zhì)標(biāo)志的同時，機(jī)動車車主可以在網(wǎng)上查看、下載檢驗標(biāo)志電子憑證。如果在檢驗有效期內(nèi)的機(jī)動車，可直接通過上述的領(lǐng)取方法進(jìn)行領(lǐng)取、查看。

檢驗標(biāo)志紙質(zhì)憑證遺失或摧毀的機(jī)動車：不需要辦理補領(lǐng)檢驗標(biāo)志，可直接通過上述的領(lǐng)取方法進(jìn)行領(lǐng)取、查看。

三、檢驗標(biāo)志電子憑證遇檢測如何出示？

檢驗標(biāo)志電子憑證出示方法：在線出示、離線出示、打印出示。

在線出示：機(jī)動車車主直接登錄“交管12123”手機(jī)APP，在線查看出示機(jī)動車檢驗標(biāo)志電子憑證。

離線出示：提前登錄互聯(lián)網(wǎng)交通安全綜合服務(wù)平臺或登錄“交管12123”手機(jī)APP，將檢驗標(biāo)志電子憑證保存到手機(jī)中，需要時出示。

打印出示：保存下的電子憑證可直接打印出，使用時出示即可。

元貝小編在這里提醒所有駕駛?cè)耍瑱C(jī)動車的檢驗標(biāo)志電子憑證與紙質(zhì)憑證具有同等效力，遇檢查時主動出示。

七、mahout面試題？

之前看了Mahout官方示例 20news 的調(diào)用實現(xiàn)；于是想根據(jù)示例的流程實現(xiàn)其他例子。網(wǎng)上看到了一個關(guān)于天氣適不適合打羽毛球的例子。

訓(xùn)練數(shù)據(jù)：

Day Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

檢測數(shù)據(jù)：

sunny，hot，high，weak

結(jié)果：

Yes=》 0.007039

No=》 0.027418

于是使用Java代碼調(diào)用Mahout的工具類實現(xiàn)分類。

基本思想：

1. 構(gòu)造分類數(shù)據(jù)。

2. 使用Mahout工具類進(jìn)行訓(xùn)練，得到訓(xùn)練模型。

3。將要檢測數(shù)據(jù)轉(zhuǎn)換成vector數(shù)據(jù)。

4. 分類器對vector數(shù)據(jù)進(jìn)行分類。

接下來貼下我的代碼實現(xiàn)=》

1. 構(gòu)造分類數(shù)據(jù)：

在hdfs主要創(chuàng)建一個文件夾路徑 /zhoujainfeng/playtennis/input 并將分類文件夾 no 和 yes 的數(shù)據(jù)傳到hdfs上面。

數(shù)據(jù)文件格式，如D1文件內(nèi)容： Sunny Hot High Weak

2. 使用Mahout工具類進(jìn)行訓(xùn)練，得到訓(xùn)練模型。

3。將要檢測數(shù)據(jù)轉(zhuǎn)換成vector數(shù)據(jù)。

4. 分類器對vector數(shù)據(jù)進(jìn)行分類。

這三步，代碼我就一次全貼出來；主要是兩個類 PlayTennis1 和 BayesCheckData = =》

package myTesting.bayes;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.util.ToolRunner;

import org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob;

import org.apache.mahout.text.SequenceFilesFromDirectory;

import org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles;

public class PlayTennis1 {

private static final String WORK_DIR = "hdfs://192.168.9.72:9000/zhoujianfeng/playtennis";

* 測試代碼

public static void main(String[] args) {

//將訓(xùn)練數(shù)據(jù)轉(zhuǎn)換成 vector數(shù)據(jù)

makeTrainVector();

//產(chǎn)生訓(xùn)練模型

makeModel(false);

//測試檢測數(shù)據(jù)

BayesCheckData.printResult();

}

public static void makeCheckVector(){

//將測試數(shù)據(jù)轉(zhuǎn)換成序列化文件

try {

Configuration conf = new Configuration();

conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));

String input = WORK_DIR+Path.SEPARATOR+"testinput";

String output = WORK_DIR+Path.SEPARATOR+"tennis-test-seq";

Path in = new Path(input);

Path out = new Path(output);

FileSystem fs = FileSystem.get(conf);

if(fs.exists(in)){

if(fs.exists(out)){

//boolean參數(shù)是，是否遞歸刪除的意思

fs.delete(out, true);

}

SequenceFilesFromDirectory sffd = new SequenceFilesFromDirectory();

String[] params = new String[]{"-i",input,"-o",output,"-ow"};

ToolRunner.run(sffd, params);

}

} catch (Exception e) {

// TODO Auto-generated catch block

e.printStackTrace();

System.out.println("文件序列化失敗！");

System.exit(1);

}

//將序列化文件轉(zhuǎn)換成向量文件

try {

Configuration conf = new Configuration();

conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));

String input = WORK_DIR+Path.SEPARATOR+"tennis-test-seq";

String output = WORK_DIR+Path.SEPARATOR+"tennis-test-vectors";

Path in = new Path(input);

Path out = new Path(output);

FileSystem fs = FileSystem.get(conf);

if(fs.exists(in)){

if(fs.exists(out)){

//boolean參數(shù)是，是否遞歸刪除的意思

fs.delete(out, true);

}

SparseVectorsFromSequenceFiles svfsf = new SparseVectorsFromSequenceFiles();

String[] params = new String[]{"-i",input,"-o",output,"-lnorm","-nv","-wt","tfidf"};

ToolRunner.run(svfsf, params);

}

} catch (Exception e) {

// TODO Auto-generated catch block

e.printStackTrace();

System.out.println("序列化文件轉(zhuǎn)換成向量失敗！");

System.out.println(2);

}

public static void makeTrainVector(){

//將測試數(shù)據(jù)轉(zhuǎn)換成序列化文件

try {

Configuration conf = new Configuration();

conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));

String input = WORK_DIR+Path.SEPARATOR+"input";

String output = WORK_DIR+Path.SEPARATOR+"tennis-seq";

Path in = new Path(input);

Path out = new Path(output);

FileSystem fs = FileSystem.get(conf);

if(fs.exists(in)){

if(fs.exists(out)){

//boolean參數(shù)是，是否遞歸刪除的意思

fs.delete(out, true);

}

SequenceFilesFromDirectory sffd = new SequenceFilesFromDirectory();

String[] params = new String[]{"-i",input,"-o",output,"-ow"};

ToolRunner.run(sffd, params);

}

} catch (Exception e) {

// TODO Auto-generated catch block

e.printStackTrace();

System.out.println("文件序列化失敗！");

System.exit(1);

}

//將序列化文件轉(zhuǎn)換成向量文件

try {

Configuration conf = new Configuration();

conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));

String input = WORK_DIR+Path.SEPARATOR+"tennis-seq";

String output = WORK_DIR+Path.SEPARATOR+"tennis-vectors";

Path in = new Path(input);

Path out = new Path(output);

FileSystem fs = FileSystem.get(conf);

if(fs.exists(in)){

if(fs.exists(out)){

//boolean參數(shù)是，是否遞歸刪除的意思

fs.delete(out, true);

}

SparseVectorsFromSequenceFiles svfsf = new SparseVectorsFromSequenceFiles();

String[] params = new String[]{"-i",input,"-o",output,"-lnorm","-nv","-wt","tfidf"};

ToolRunner.run(svfsf, params);

}

} catch (Exception e) {

// TODO Auto-generated catch block

e.printStackTrace();

System.out.println("序列化文件轉(zhuǎn)換成向量失敗！");

System.out.println(2);

}

public static void makeModel(boolean completelyNB){

try {

Configuration conf = new Configuration();

conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));

String input = WORK_DIR+Path.SEPARATOR+"tennis-vectors"+Path.SEPARATOR+"tfidf-vectors";

String model = WORK_DIR+Path.SEPARATOR+"model";

String labelindex = WORK_DIR+Path.SEPARATOR+"labelindex";

Path in = new Path(input);

Path out = new Path(model);

Path label = new Path(labelindex);

FileSystem fs = FileSystem.get(conf);

if(fs.exists(in)){

if(fs.exists(out)){

//boolean參數(shù)是，是否遞歸刪除的意思

fs.delete(out, true);

}

if(fs.exists(label)){

//boolean參數(shù)是，是否遞歸刪除的意思

fs.delete(label, true);

}

TrainNaiveBayesJob tnbj = new TrainNaiveBayesJob();

String[] params =null;

if(completelyNB){

params = new String[]{"-i",input,"-el","-o",model,"-li",labelindex,"-ow","-c"};

}else{

params = new String[]{"-i",input,"-el","-o",model,"-li",labelindex,"-ow"};

}

ToolRunner.run(tnbj, params);

}

} catch (Exception e) {

// TODO Auto-generated catch block

e.printStackTrace();

System.out.println("生成訓(xùn)練模型失敗！");

System.exit(3);

}

package myTesting.bayes;

import java.io.IOException;

import java.util.HashMap;

import java.util.Map;

import org.apache.commons.lang.StringUtils;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.fs.PathFilter;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.mahout.classifier.naivebayes.BayesUtils;

import org.apache.mahout.classifier.naivebayes.NaiveBayesModel;

import org.apache.mahout.classifier.naivebayes.StandardNaiveBayesClassifier;

import org.apache.mahout.common.Pair;

import org.apache.mahout.common.iterator.sequencefile.PathType;

import org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterable;

import org.apache.mahout.math.RandomAccessSparseVector;

import org.apache.mahout.math.Vector;

import org.apache.mahout.math.Vector.Element;

import org.apache.mahout.vectorizer.TFIDF;

import com.google.common.collect.ConcurrentHashMultiset;

import com.google.common.collect.Multiset;

public class BayesCheckData {

private static StandardNaiveBayesClassifier classifier;

private static Map<String, Integer> dictionary;

private static Map<Integer, Long> documentFrequency;

private static Map<Integer, String> labelIndex;

public void init(Configuration conf){

try {

String modelPath = "/zhoujianfeng/playtennis/model";

String dictionaryPath = "/zhoujianfeng/playtennis/tennis-vectors/dictionary.file-0";

String documentFrequencyPath = "/zhoujianfeng/playtennis/tennis-vectors/df-count";

String labelIndexPath = "/zhoujianfeng/playtennis/labelindex";

dictionary = readDictionnary(conf, new Path(dictionaryPath));

documentFrequency = readDocumentFrequency(conf, new Path(documentFrequencyPath));

labelIndex = BayesUtils.readLabelIndex(conf, new Path(labelIndexPath));

NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelPath), conf);

classifier = new StandardNaiveBayesClassifier(model);

} catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

System.out.println("檢測數(shù)據(jù)構(gòu)造成vectors初始化時報錯。。。。");

System.exit(4);

}

/**

* 加載字典文件，Key: TermValue； Value：TermID

* @param conf

* @param dictionnaryDir

* @return

private static Map<String, Integer> readDictionnary(Configuration conf, Path dictionnaryDir) {

Map<String, Integer> dictionnary = new HashMap<String, Integer>();

PathFilter filter = new PathFilter() {

@Override

public boolean accept(Path path) {

String name = path.getName();

return name.startsWith("dictionary.file");

}

};

for (Pair<Text, IntWritable> pair : new SequenceFileDirIterable<Text, IntWritable>(dictionnaryDir, PathType.LIST, filter, conf)) {

dictionnary.put(pair.getFirst().toString(), pair.getSecond().get());

}

return dictionnary;

}

/**

* 加載df-count目錄下TermDoc頻率文件，Key: TermID； Value：DocFreq

* @param conf

* @param dictionnaryDir

* @return

private static Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyDir) {

Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();

PathFilter filter = new PathFilter() {

@Override

public boolean accept(Path path) {

return path.getName().startsWith("part-r");

}

};

for (Pair<IntWritable, LongWritable> pair : new SequenceFileDirIterable<IntWritable, LongWritable>(documentFrequencyDir, PathType.LIST, filter, conf)) {

documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());

}

return documentFrequency;

}

public static String getCheckResult(){

Configuration conf = new Configuration();

conf.addResource(new Path("/usr/local/hadoop/conf/core-site.xml"));

String classify = "NaN";

BayesCheckData cdv = new BayesCheckData();

cdv.init(conf);

System.out.println("init done...............");

Vector vector = new RandomAccessSparseVector(10000);

TFIDF tfidf = new TFIDF();

//sunny，hot，high，weak

Multiset<String> words = ConcurrentHashMultiset.create();

words.add("sunny",1);

words.add("hot",1);

words.add("high",1);

words.add("weak",1);

int documentCount = documentFrequency.get(-1).intValue(); // key=-1時表示總文檔數(shù)

for (Multiset.Entry<String> entry : words.entrySet()) {

String word = entry.getElement();

int count = entry.getCount();

Integer wordId = dictionary.get(word); // 需要從dictionary.file-0文件（tf-vector）下得到wordID，

if (StringUtils.isEmpty(wordId.toString())){

continue;

}

if (documentFrequency.get(wordId) == null){

continue;

}

Long freq = documentFrequency.get(wordId);

double tfIdfValue = tfidf.calculate(count, freq.intValue(), 1, documentCount);

vector.setQuick(wordId, tfIdfValue);

}

// 利用貝葉斯算法開始分類,并提取得分最好的分類label

Vector resultVector = classifier.classifyFull(vector);

double bestScore = -Double.MAX_VALUE;

int bestCategoryId = -1;

for(Element element: resultVector.all()) {

int categoryId = element.index();

double score = element.get();

System.out.println("categoryId:"+categoryId+" score:"+score);

if (score > bestScore) {

bestScore = score;

bestCategoryId = categoryId;

}

classify = labelIndex.get(bestCategoryId)+"(categoryId="+bestCategoryId+")";

return classify;

}

public static void printResult(){

System.out.println("檢測所屬類別是："+getCheckResult());

}

八、webgis面試題？

1. 請介紹一下WebGIS的概念和作用，以及在實際應(yīng)用中的優(yōu)勢和挑戰(zhàn)。

WebGIS是一種基于Web技術(shù)的地理信息系統(tǒng)，通過將地理數(shù)據(jù)和功能以可視化的方式呈現(xiàn)在Web瀏覽器中，實現(xiàn)地理空間數(shù)據(jù)的共享和分析。它可以用于地圖瀏覽、空間查詢、地理分析等多種應(yīng)用場景。WebGIS的優(yōu)勢包括易于訪問、跨平臺、實時更新、可定制性強等，但也面臨著數(shù)據(jù)安全性、性能優(yōu)化、用戶體驗等挑戰(zhàn)。

2. 請談?wù)勀赪ebGIS開發(fā)方面的經(jīng)驗和技能。

我在WebGIS開發(fā)方面有豐富的經(jīng)驗和技能。我熟悉常用的WebGIS開發(fā)框架和工具，如ArcGIS API for JavaScript、Leaflet、OpenLayers等。我能夠使用HTML、CSS和JavaScript等前端技術(shù)進(jìn)行地圖展示和交互設(shè)計，并能夠使用后端技術(shù)如Python、Java等進(jìn)行地理數(shù)據(jù)處理和分析。我還具備數(shù)據(jù)庫管理和地理空間數(shù)據(jù)建模的能力，能夠設(shè)計和優(yōu)化WebGIS系統(tǒng)的架構(gòu)。

3. 請描述一下您在以往項目中使用WebGIS解決的具體問題和取得的成果。

在以往的項目中，我使用WebGIS解決了許多具體問題并取得了顯著的成果。例如，在一次城市規(guī)劃項目中，我開發(fā)了一個基于WebGIS的交通流量分析系統(tǒng)，幫助規(guī)劃師們評估不同交通方案的效果。另外，在一次環(huán)境監(jiān)測項目中，我使用WebGIS技術(shù)實現(xiàn)了實時的空氣質(zhì)量監(jiān)測和預(yù)警系統(tǒng)，提供了準(zhǔn)確的空氣質(zhì)量數(shù)據(jù)和可視化的分析結(jié)果，幫助政府和公眾做出相應(yīng)的決策。

4. 請談?wù)勀鷮ebGIS未來發(fā)展的看法和期望。

我認(rèn)為WebGIS在未來會繼續(xù)發(fā)展壯大。隨著云計算、大數(shù)據(jù)和人工智能等技術(shù)的不斷進(jìn)步，WebGIS將能夠處理更大規(guī)模的地理數(shù)據(jù)、提供更豐富的地理分析功能，并與其他領(lǐng)域的技術(shù)進(jìn)行深度融合。我期望未來的WebGIS能夠更加智能化、個性化，為用戶提供更好的地理信息服務(wù)，助力各行各業(yè)的決策和發(fā)展。